# Hacka (midterm) thon 

## Detecting Malicious URLs 

Today you are invited to repeat the path of researchers Detecting Malicious URLs.
An anonymized 120-day subset of our ICML-09 data set.
The data set consists of about 2.4 million URLs (examples) and 3.2 million features. 

#### 1. Download data using link below
[Download Dataset](http://www.sysnet.ucsd.edu/projects/url/url_svmlight.tar.gz)

#### 2. Description of Data (SVM-light)
Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:

1. **FeatureTypes**. A text file list of feature indices that correspond to real-valued features.
2. **DayX.svm** (where X is an integer from 0 to 120) --- The data for day X in SVM-light format. A label of +1 corresponds to a malicious URL and -1 corresponds to a benign URL.


#### 3. Read article
Please familiarize yourself with original research article. It will give you required context.

*"**Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs**"* 

*Justin Ma, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker* 

## Demo part

#### 1. Upload data

In [1]:
import glob
import matplotlib.pyplot as plt
from sklearn.datasets import load_svmlight_file
files = glob.glob('./url_svmlight/*.svm')
print("There are %d files" % len(files))
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

There are 123 files


#### 2. What is inside

In [2]:
import tarfile
from sklearn.datasets import load_svmlight_file
import numpy as np

In [3]:
uri = ('./url_svmlight.tar.gz')
tar = tarfile.open(uri, "r:gz")
max_obs = 0
max_vars = 0
i = 0
split = 5
for tarinfo in tar:
    print("extracting %s,f size %s" % (tarinfo.name, tarinfo.size))
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f)
        max_vars = np.maximum(max_vars, X.shape[0])
        max_obs = np.maximum(max_obs, X.shape[1])
    if i > split:
        break
    i+=1
print("max X = %s, max y dimension = %s" % (max_obs, max_vars)) 

extracting url_svmlight,f size 0
extracting url_svmlight/Day33.svm,f size 18674876
extracting url_svmlight/Day32.svm,f size 18599211
extracting url_svmlight/Day53.svm,f size 18963938
extracting url_svmlight/Day20.svm,f size 18633460
extracting url_svmlight/Day7.svm,f size 18777054
extracting url_svmlight/Day117.svm,f size 18106370
max X = 3231952, max y dimension = 20000


#### 3. What is inside

In [4]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

classes = [-1,1] # 1_:url- safety, -1: url- non-safety
sgd = SGDClassifier(loss='log')
n_features = 3231952
split = 5
i = 0
for tarinfo in tar:
    if i > split:
        break
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f,n_features=n_features)
        if i < split:
            sgd.partial_fit(X,y, classes = classes)
        if i == split:
            print (classification_report(sgd.predict(X),y))
    i+=1

              precision    recall  f1-score   support

          -1       0.99      0.97      0.98     14590
           1       0.92      0.98      0.95      5410

    accuracy                           0.97     20000
   macro avg       0.96      0.97      0.96     20000
weighted avg       0.97      0.97      0.97     20000



## Midterm (Part 2)

### Grading criteria
- Complete solution - 60%
- F1 Score - 40%
    - The first 10 results get 40%
    - Worst result get 20%
    - All others are on a scale between them

### Deadline
20:00 MSK, April 4

#### 1. Train, test
- Upload data (you can use template above)
- Separate your dataset into train and test subsets of observations
- Use the 8:2 ratio: 80% train set, 20% test set

In [5]:
from random import random
def train_test_split(train_size = 0.8):
    
    train = open('./url_svmlight/train.svm','w')
    test  = open('./url_svmlight/test.svm','w')
    summ = 0
    for i in range(121):
        inn = open('./url_svmlight/Day' + str(i) + '.svm','r')
        
        print("file: " + str(i))
        
        q = [0, 0]
        
        for line in inn:
            a = line.split()
            if(a[0] == "-1"):
                q[0] += 1
            else:
                q[1] += 1
        
        inn.close()
        summ += q[0] + q[1]
        
        check = [int(train_size * q[0]),int(train_size * q[1])]
        start = [0, 0]
        
        inn = open('./url_svmlight/Day' + str(i) + '.svm','r')
        
        if (train_size * q[0]) % 1 >=0.5:
            check[0] += 1
        if (train_size * q[1]) % 1 >=0.5:
            check[1] += 1
        for line in inn:
            a = line.split()

            rand = random()
            if rand > 0.5:
                if a[0] == "-1":
                    if start[0] + 1 <= check[0]:
                        train.write(line)
                        train.write("\n")
                        start[0] += 1
                    else:
                        test.write(line)
                        test.write("\n")
                        q[0] -= 1
                else:
                    if start[1] + 1 <= check[1]:
                        train.write(line)
                        train.write("\n")
                        start[1] += 1
                    else:
                        test.write(line)
                        test.write("\n")
                        q[1] -= 1
            else:
                if(a[0] == "-1"):
                    if q[0] > check[0]:
                        test.write(line)
                        test.write("\n")
                        q[0] -= 1
                    else :
                        train.write(line)
                        train.write("\n")
                        start[0] += 1
                else:
                    if q[1] > check[1]:
                        test.write(line)
                        test.write("\n")
                        q[1] -= 1
                    else :
                        train.write(line)
                        train.write("\n")
                        start[1] += 1
        print("finish: " + str(i))
    print(summ)

In [6]:
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import Perceptron
import numpy as np

train_test_split()

data = None
n_features = 3231961


data, target = load_svmlight_file("./url_svmlight/train.svm",n_features=n_features)

file: 0
finish: 0
file: 1
finish: 1
file: 2
finish: 2
file: 3
finish: 3
file: 4
finish: 4
file: 5
finish: 5
file: 6
finish: 6
file: 7
finish: 7
file: 8
finish: 8
file: 9
finish: 9
file: 10
finish: 10
file: 11
finish: 11
file: 12
finish: 12
file: 13
finish: 13
file: 14
finish: 14
file: 15
finish: 15
file: 16
finish: 16
file: 17
finish: 17
file: 18
finish: 18
file: 19
finish: 19
file: 20
finish: 20
file: 21
finish: 21
file: 22
finish: 22
file: 23
finish: 23
file: 24
finish: 24
file: 25
finish: 25
file: 26
finish: 26
file: 27
finish: 27
file: 28
finish: 28
file: 29
finish: 29
file: 30
finish: 30
file: 31
finish: 31
file: 32
finish: 32
file: 33
finish: 33
file: 34
finish: 34
file: 35
finish: 35
file: 36
finish: 36
file: 37
finish: 37
file: 38
finish: 38
file: 39
finish: 39
file: 40
finish: 40
file: 41
finish: 41
file: 42
finish: 42
file: 43
finish: 43
file: 44
finish: 44
file: 45
finish: 45
file: 46
finish: 46
file: 47
finish: 47
file: 48
finish: 48
file: 49
finish: 49
file: 50
finish: 50


In [7]:
print(data)

  (0, 1)	1.0
  (0, 3)	0.0912863
  (0, 4)	0.144828
  (0, 5)	0.117647
  (0, 9)	1.0
  (0, 10)	0.142857
  (0, 16)	0.760482
  (0, 17)	0.820882
  (0, 18)	0.150678
  (0, 20)	0.142856
  (0, 21)	0.142857
  (0, 23)	1.0
  (0, 27)	1.0
  (0, 32)	0.111111
  (0, 43)	1.0
  (0, 53)	1.0
  (0, 55)	1.0
  (0, 61)	1.0
  (0, 63)	1.0
  (0, 65)	1.0
  (0, 67)	1.0
  (0, 69)	1.0
  (0, 71)	1.0
  (0, 73)	1.0
  (0, 75)	1.0
  :	:
  (1916903, 155178)	1.0
  (1916903, 155179)	1.0
  (1916903, 155180)	1.0
  (1916903, 155181)	1.0
  (1916903, 155182)	1.0
  (1916903, 155193)	1.0
  (1916903, 155194)	1.0
  (1916903, 155195)	1.0
  (1916903, 155196)	1.0
  (1916903, 155197)	1.0
  (1916903, 155198)	1.0
  (1916903, 155199)	1.0
  (1916903, 155200)	1.0
  (1916903, 155201)	1.0
  (1916903, 155202)	1.0
  (1916903, 155203)	1.0
  (1916903, 155204)	1.0
  (1916903, 155205)	1.0
  (1916903, 155206)	1.0
  (1916903, 155207)	1.0
  (1916903, 155208)	1.0
  (1916903, 155209)	1.0
  (1916903, 155210)	1.0
  (1916903, 155211)	1.0
  (1916903, 155212)	1.

In [8]:
print(target)

[-1. -1. -1. ...  1. -1. -1.]


#### 2. Find out whether it is possible to reduce the dimension?

In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix

In [10]:
#data = StandardScaler(with_mean=False).fit_transform(data)

In [11]:
#print(data)

In [12]:
 data_sparse = csr_matrix(data)

In [13]:
 print(data_sparse)

  (0, 1)	1.0
  (0, 3)	0.0912863
  (0, 4)	0.144828
  (0, 5)	0.117647
  (0, 9)	1.0
  (0, 10)	0.142857
  (0, 16)	0.760482
  (0, 17)	0.820882
  (0, 18)	0.150678
  (0, 20)	0.142856
  (0, 21)	0.142857
  (0, 23)	1.0
  (0, 27)	1.0
  (0, 32)	0.111111
  (0, 43)	1.0
  (0, 53)	1.0
  (0, 55)	1.0
  (0, 61)	1.0
  (0, 63)	1.0
  (0, 65)	1.0
  (0, 67)	1.0
  (0, 69)	1.0
  (0, 71)	1.0
  (0, 73)	1.0
  (0, 75)	1.0
  :	:
  (1916903, 155178)	1.0
  (1916903, 155179)	1.0
  (1916903, 155180)	1.0
  (1916903, 155181)	1.0
  (1916903, 155182)	1.0
  (1916903, 155193)	1.0
  (1916903, 155194)	1.0
  (1916903, 155195)	1.0
  (1916903, 155196)	1.0
  (1916903, 155197)	1.0
  (1916903, 155198)	1.0
  (1916903, 155199)	1.0
  (1916903, 155200)	1.0
  (1916903, 155201)	1.0
  (1916903, 155202)	1.0
  (1916903, 155203)	1.0
  (1916903, 155204)	1.0
  (1916903, 155205)	1.0
  (1916903, 155206)	1.0
  (1916903, 155207)	1.0
  (1916903, 155208)	1.0
  (1916903, 155209)	1.0
  (1916903, 155210)	1.0
  (1916903, 155211)	1.0
  (1916903, 155212)	1.

In [14]:
tsvd = TruncatedSVD(n_components=80)


In [15]:
data_sparse_tsvd = tsvd.fit(data_sparse).transform(data_sparse)

In [16]:
data_tsvd = tsvd.fit(data).transform(data)

In [17]:
print("Original number of sparsed features:", data_sparse.shape[1])
print("Original number of features:", data.shape[1])
print("Reduced number of sparsed features:", data_sparse_tsvd.shape[1])
print("Reduced number of features:", data_tsvd.shape[1])

Original number of sparsed features: 3231961
Original number of features: 3231961
Reduced number of sparsed features: 80
Reduced number of features: 80


#### 3. Create a model

In [18]:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier()
pac.fit(data,target)

PassiveAggressiveClassifier()

#### 4. Get the quality
- precision
- recall
- f1-score
- support 

In [19]:
from sklearn.metrics import classification_report
test_data,test_target = load_svmlight_file("./url_svmlight/test.svm",n_features=n_features)
#the accuracy before dimension reduction 
print (classification_report(pac.predict(test_data),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.991957  0.995866  0.993908    319536
         1.0   0.991662  0.983844  0.987737    159690

    accuracy                       0.991860    479226
   macro avg   0.991810  0.989855  0.990823    479226
weighted avg   0.991859  0.991860  0.991852    479226



In [20]:
#Dimension Reduction without using sparse matrix
pac.fit(data_tsvd,target)

PassiveAggressiveClassifier()

In [21]:
test_data_tsvd = tsvd.transform(test_data)

In [22]:
print (classification_report(pac.predict(test_data_tsvd),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.961870  0.980483  0.971087    314705
         1.0   0.961232  0.925651  0.943106    164521

    accuracy                       0.961659    479226
   macro avg   0.961551  0.953067  0.957097    479226
weighted avg   0.961651  0.961659  0.961481    479226



In [23]:
#Dimension Reduction with using sparse matrix
pac.fit(data_sparse_tsvd,target)

PassiveAggressiveClassifier()

In [24]:
#test_data = StandardScaler(with_mean=False).fit_transform(test_data)
test_data_sparse = csr_matrix(test_data)

In [25]:
test_data_sparse_tsvd = tsvd.transform(test_data_sparse)

In [26]:
#####the accuracy after dimension reduction using sparse matrix#######################


In [27]:
#1-PassiveAggressiveClassifier
print (classification_report(pac.predict(test_data_sparse_tsvd),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.983307  0.947908  0.965283    332775
         1.0   0.890583  0.963435  0.925578    146451

    accuracy                       0.952653    479226
   macro avg   0.936945  0.955671  0.945430    479226
weighted avg   0.954971  0.952653  0.953149    479226



In [28]:
#the accuracy is not good with using sparse matrix before using dimension reduction 
#such that accuracy with sparse matrix 95% but without using sparse matrix 96% in case of PassiveAggressiveClassifier
#So,I will use dimension reduction without sparse matrix with different classification models to compare the accuracy with
#different classifiers.

In [29]:
#2-Decision tree

from sklearn.tree import DecisionTreeClassifier
model_tree=DecisionTreeClassifier(max_depth=25)

In [30]:
#without using dimension reduction
model_tree.fit(data,target)
print (classification_report(model_tree.predict(test_data),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.993161  0.992186  0.992673    321110
         1.0   0.984163  0.986124  0.985143    158116

    accuracy                       0.990186    479226
   macro avg   0.988662  0.989155  0.988908    479226
weighted avg   0.990192  0.990186  0.990189    479226



In [31]:
#with using dimension reduction
model_tree.fit(data_tsvd, target)
print (classification_report(model_tree.predict(test_data_tsvd),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.986839  0.984274  0.985555    321631
         1.0   0.968074  0.973210  0.970635    157595

    accuracy                       0.980635    479226
   macro avg   0.977457  0.978742  0.978095    479226
weighted avg   0.980668  0.980635  0.980648    479226



In [32]:
#3-Logistic regression
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(max_iter=50)

In [33]:
#without using dimension reduction
model_lr.fit(data, target)
print (classification_report(model_lr.predict(test_data),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.985686  0.987628  0.986656    320164
         1.0   0.974999  0.971131  0.973061    159062

    accuracy                       0.982152    479226
   macro avg   0.980342  0.979379  0.979858    479226
weighted avg   0.982138  0.982152  0.982144    479226



In [34]:
#with using dimension reduction
model_lr.fit(data_tsvd, target)
print (classification_report(model_lr.predict(test_data_tsvd),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.972366  0.977546  0.974949    319095
         1.0   0.954775  0.944639  0.949680    160131

    accuracy                       0.966550    479226
   macro avg   0.963570  0.961092  0.962314    479226
weighted avg   0.966488  0.966550  0.966505    479226



In [35]:
#4-Random Forest
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(max_depth=25)

In [36]:
#without using dimension reduction
model_rf.fit(data, target)
print (classification_report(model_rf.predict(test_data),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.999857  0.752039  0.858420    426506
         1.0   0.332473  0.999127  0.498923     52720

    accuracy                       0.779221    479226
   macro avg   0.666165  0.875583  0.678671    479226
weighted avg   0.926437  0.779221  0.818871    479226



In [37]:
#with using dimension reduction
model_rf.fit(data_tsvd, target)
print (classification_report(model_rf.predict(test_data_tsvd),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.990365  0.988359  0.989361    321446
         1.0   0.976381  0.980409  0.978391    157780

    accuracy                       0.985742    479226
   macro avg   0.983373  0.984384  0.983876    479226
weighted avg   0.985761  0.985742  0.985749    479226



In [38]:
#5-Support Vector Classification
from sklearn.svm import SVC
model_svc= SVC(max_iter=50)

In [39]:
#without using dimension reduction
model_svc.fit(data, target)
print (classification_report( model_svc.predict(test_data),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.907505  0.876169  0.891562    332268
         1.0   0.740297  0.798092  0.768109    146958

    accuracy                       0.852226    479226
   macro avg   0.823901  0.837131  0.829835    479226
weighted avg   0.856229  0.852226  0.853704    479226



In [40]:
#with using dimension reduction
model_svc.fit(data_tsvd, target)
print (classification_report( model_svc.predict(test_data_tsvd),test_target, digits = 6))

              precision    recall  f1-score   support

        -1.0   0.559862  0.767031  0.647274    234151
         1.0   0.655686  0.423874  0.514892    245075

    accuracy                       0.591541    479226
   macro avg   0.607774  0.595453  0.581083    479226
weighted avg   0.608866  0.591541  0.579574    479226



In [43]:
#Report
             #1-PassiveAggressiveClassifier 2-Decisiontree    3-Logisticregression  4-Randomforest    5-SVC
#without DR      99%                          99%                 98%                    77.9%            85%
#with DR         96%                          98%                 96.7%                  98.6%            59%

In [44]:
#the best accuracy by using Decision tree classifier with using dimension reduction is 98%
#but with out using dimension reduction is 99%.


In [49]:
"""
the dimension reduction achieved good results in case of Random Forest from 77.8% to 98.6% but with other classifiers models
the accuracy is reduced about 1% by using decision tree which is achieved after using dimension reduction.
So, I think we don't need to do dimension reduction in this case from dataset with most of classifiers but in case of using
random forest classifier we need to do dimension reduction.
"""

"\nthe dimension reduction achieved good results in case of Random Forest from 77.8% to 98.6% but with other classifiers models\nthe accuracy is reduced about 1% by using decision tree which is achieved after using dimension reduction.\nSo, I think we don't need to do dimension reduction in this case from dataset with most of classifiers but in case of using\nrandom forest classifier we need to do dimension reduction.\n"