# Model Comparisons

This notebook shows a few comparisons with the use of different sets of data, below is a summary of the data properties we used to compare our model performances. The csv files are saved in [Data folder link](https://drive.google.com/drive/folders/1eImejP0Yh5Wf0pd1PAfwiVDReUCgM45a)

### Unbalanced VS Balanced data
| __Data files__   |  __# of Data points__ | __Features without DNS data__ |__Trusted / Untrusted Balance__| 
|:------:|:------:|:------:|:------:|
| `domain_40k_features.csv`| 40k | API records + Social Links | 90%-10% (Unbalanced) | 
| `domain_46k_features.csv` | 46k | API records + Social Links| 50%-50% (Balanced)| 
| `domain_130k_features.csv` | 130k | API records + Social Links | 90%-10% (Unbalanced)| 

### We will demonstrate below approaches in order to solve the unbalanced situation and further improve our model.

- Up-Samplying
- Down-Samplying
- One-Hot Encoding

### There are 6 sets of models created in total:
Note that after removing the 'pending' labels, the 130k data becomes 60k.

| __Feature sets__ |__Balanced?__| __LR Models__ |__LR f1-scores__|__RF Models__|__RF f1-scores__| 
|:------:|:------:|:------:|:------:|:------:|:------:|
| 40k |No| lr1 | trusted: 0.96, untrusted: 0 | rf1 | trusted: 0.96, untrusted: 0| 
| 46k |Yes| lr2 | trusted: 0.89, untrusted: 0.89 | rf2 | trusted: 0.90, untrusted: 0.90| 
| 60k |No| lr3 | trusted: 0.95, untrusted: 0 | rf3 | trusted: 0.95, untrusted: 0| 
| Down-sampled |Yes| lr4 | trusted: 0.74, untrusted: 0.76 | rf4 | trusted: 0.74, untrusted: 0.76| 
| Up-sampled |Yes| lr5 | trusted: 0.74, untrusted: 0.76 | rf5 | trusted: 0.76, untrusted: 0.79| 
| 46k + One Hot Encoding |Yes| lr6 | trusted: 0.91, untrusted: 0.92 | rf6 | trusted: 0.91, untrusted: 0.92|

As you can see from the above table, Model set 6 has both Balanced property and the highest f1-score. 
#### Hence, we are able to achieve a 92% f1-score for a balanced model performance. This is without any DNS data involved. More details below:

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import ast

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils import resample

In [2]:
PATH = Path("../data")
list(PATH.iterdir())

[PosixPath('../data/test_combined.csv'),
 PosixPath('../data/domain_130k_features.csv'),
 PosixPath('../data/domain_130k_features_dns.csv'),
 PosixPath('../data/.DS_Store'),
 PosixPath('../data/domain_130k_all.csv'),
 PosixPath('../data/domain_130k_dns.csv'),
 PosixPath('../data/test_sample.csv'),
 PosixPath('../data/domain_40k_features.csv'),
 PosixPath('../data/README.md'),
 PosixPath('../data/domain_46k_features.csv'),
 PosixPath('../data/.ipynb_checkpoints'),
 PosixPath('../data/test_domain_100.csv')]

In [3]:
df_1 = pd.read_csv(PATH/'domain_40k_features.csv', low_memory=False)
df_2 = pd.read_csv(PATH/'domain_46k_features.csv', low_memory=False)
df_3 = pd.read_csv(PATH/'domain_130k_features.csv', low_memory=False)

In [4]:
df_1.replace({'[]': None, np.NaN: None}, inplace=True)
df_2.replace({'[]': None, np.NaN: None}, inplace=True)
df_3.replace({'[]': None, np.NaN: None}, inplace=True)

## Remove all the 'pending' labels

In [5]:
df_40k = df_1[df_1['label']!='pending']
df_46k= df_2[df_2['label']!='pending']
df_60k = df_3[df_3['label']!='pending']

## Model 1 - 40k: Unbalanced data

In [6]:
df_40k.label.value_counts()

trusted      37671
untrusted     3428
Name: label, dtype: int64

In [7]:
X, y = df_40k.select_dtypes(exclude='object'), df_40k['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
lr1 = LogisticRegression()
lr1.fit(X_train, y_train)

print(classification_report(lr1.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       1.00      0.92      0.96      8220
   untrusted       0.00      0.00      0.00         0

   micro avg       0.92      0.92      0.92      8220
   macro avg       0.50      0.46      0.48      8220
weighted avg       1.00      0.92      0.96      8220



In [9]:
rf1 = RandomForestClassifier(n_estimators=100)
rf1.fit(X_train, y_train)

print(classification_report(rf1.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       1.00      0.92      0.96      8208
   untrusted       0.00      0.08      0.00        12

   micro avg       0.92      0.92      0.92      8220
   macro avg       0.50      0.50      0.48      8220
weighted avg       1.00      0.92      0.96      8220



# Model 2 - 46k: balanced data

In [10]:
df_46k.label.value_counts()

untrusted    23329
trusted      23329
Name: label, dtype: int64

In [11]:
X, y = df_46k.select_dtypes(exclude='object'), df_46k['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
lr2 = LogisticRegression()
lr2.fit(X_train, y_train)

print(classification_report(lr2.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.89      0.90      0.89      4603
   untrusted       0.90      0.89      0.89      4729

   micro avg       0.89      0.89      0.89      9332
   macro avg       0.89      0.89      0.89      9332
weighted avg       0.89      0.89      0.89      9332



In [13]:
rf2 = RandomForestClassifier(n_estimators=100)
rf2.fit(X_train, y_train)

print(classification_report(rf2.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.87      0.92      0.90      4422
   untrusted       0.92      0.88      0.90      4910

   micro avg       0.90      0.90      0.90      9332
   macro avg       0.90      0.90      0.90      9332
weighted avg       0.90      0.90      0.90      9332



# Model 3 - 60k: Unbalanced data

In [14]:
df_60k.label.value_counts()

trusted      55976
untrusted     5370
Name: label, dtype: int64

In [15]:
X, y = df_60k.select_dtypes(exclude='object'), df_60k['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
lr3 = LogisticRegression()
lr3.fit(X_train, y_train)

print(classification_report(lr3.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       1.00      0.91      0.95     12270
   untrusted       0.00      0.00      0.00         0

   micro avg       0.91      0.91      0.91     12270
   macro avg       0.50      0.46      0.48     12270
weighted avg       1.00      0.91      0.95     12270



In [17]:
rf3 = RandomForestClassifier(n_estimators=100)
rf3.fit(X_train, y_train)

print(classification_report(rf3.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       1.00      0.91      0.95     12261
   untrusted       0.00      0.00      0.00         9

   micro avg       0.91      0.91      0.91     12270
   macro avg       0.50      0.46      0.48     12270
weighted avg       1.00      0.91      0.95     12270



# Combined three datasets --> 146k data

In [18]:
df_146k = pd.concat([df_40k, df_46k, df_60k], axis=0)

In [19]:
df_146k.label.value_counts()

trusted      116976
untrusted     32127
Name: label, dtype: int64

# Model 4: Down-Samplying

In [20]:
trusted = df_146k[df_146k['label']=='trusted']
untrusted = df_146k[df_146k['label']=='untrusted']

In [21]:
trusted_downsampled = resample(trusted, 
                               replace=False, # untrusted << trusted, no need to replace
                               n_samples=len(untrusted), 
                               random_state=123)

In [22]:
df_down = pd.concat([trusted_downsampled, untrusted], axis=0)

In [23]:
X, y = df_down.select_dtypes(exclude='object'), df_down['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
lr4 = LogisticRegression()
lr4.fit(X_train, y_train)

print(classification_report(lr4.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.71      0.77      0.74      5858
   untrusted       0.79      0.73      0.76      6993

   micro avg       0.75      0.75      0.75     12851
   macro avg       0.75      0.75      0.75     12851
weighted avg       0.75      0.75      0.75     12851



In [25]:
rf4 = RandomForestClassifier(n_estimators=100)
rf4.fit(X_train, y_train)

print(classification_report(rf4.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.68      0.80      0.74      5478
   untrusted       0.83      0.73      0.77      7373

   micro avg       0.76      0.76      0.76     12851
   macro avg       0.76      0.76      0.76     12851
weighted avg       0.77      0.76      0.76     12851



# Model 5: Up-Samplying

In [26]:
untrusted_upsampled = resample(untrusted, 
                               replace=True, # untrusted << trusted, need to replace
                               n_samples=len(trusted), 
                               random_state=123)

In [27]:
df_up = pd.concat([trusted, untrusted_upsampled], axis=0)

In [28]:
X, y = df_up.select_dtypes(exclude='object'), df_up['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
lr5 = LogisticRegression()
lr5.fit(X_train, y_train)

print(classification_report(lr5.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.72      0.77      0.74     21578
   untrusted       0.79      0.74      0.76     25213

   micro avg       0.75      0.75      0.75     46791
   macro avg       0.75      0.75      0.75     46791
weighted avg       0.76      0.75      0.75     46791



In [30]:
rf5 = RandomForestClassifier(n_estimators=100)
rf5.fit(X_train, y_train)

print(classification_report(rf5.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.70      0.82      0.76     19850
   untrusted       0.85      0.74      0.79     26941

   micro avg       0.78      0.78      0.78     46791
   macro avg       0.78      0.78      0.78     46791
weighted avg       0.79      0.78      0.78     46791



### Model 2 is still the best, now we need to further improve model 2

# One-hot encoding to create first 30 categories

In [31]:
tld_cat = df_46k.tld.value_counts()[:30].keys().tolist()
tld_cat.append('other')

In [32]:
len(tld_cat)

31

In [33]:
df_46k['tld_type'] =  pd.Categorical(df_46k['tld'], categories=tld_cat)
df_46k['tld_type'].fillna('other', inplace=True)

In [34]:
tld_enc = pd.get_dummies(df_46k['tld_type'])

In [35]:
df_46k = pd.concat([df_46k, tld_enc], axis=1)
df_46k.drop(['tld_type'], axis=1, inplace=True)

In [36]:
df_46k.head()

Unnamed: 0,domain,label,app_list,security_trails,whois_counts,company_name,company_name_counts,host_provider,host_provider_counts,mail_provider,...,bid,ca,it,cn,co,cl,eu,fr,biz,other
0,walmart.com,trusted,"[[['Advertising Networks'], 'AppNexus'], [['An...",[{'whois': {'registrar': 'CSC Corporate Domain...,100,{'wal-mart stores'},1,"{'China Telecom (Group)', 'Rackspace Hosting',...",21,"{'Rackspace Hosting', 'MessageLabs Inc.', 'Pro...",...,0,0,0,0,0,0,0,0,0,0
1,exxonmobil.com,trusted,"[[['JavaScript Frameworks'], 'AngularJS'], [['...","[{'whois': {'registrar': 'DreamHost, LLC', 'ex...",100,"{'esso norge as', 'exxon mobil corporation', '...",3,"{'AT&T Services, Inc.', 'SherWeb inc.', 'Racks...",17,"{'COGECODATA', 'U.S. BANCORP', 'SherWeb inc.',...",...,0,0,0,0,0,0,0,0,0,0
2,chevron.com,trusted,"[[['Programming Languages'], 'Java'], [['Analy...",[{'whois': {'registrar': 'CSC Corporate Domain...,100,"{'chevron corporation', 'chevron corp.', 'nors...",3,"{'ExactTarget, Inc.', 'Corporation Service Com...",12,"{'GoDaddy.com, LLC', 'ExactTarget, Inc.', 'Tex...",...,0,0,0,0,0,0,0,0,0,0
3,berkshirehathaway.com,trusted,"[[['Web Servers'], 'Apache'], [['Editors'], 'M...",,0,,0,,0,,...,0,0,0,0,0,0,0,0,0,0
4,apple.com,trusted,"[[['Web Servers'], 'Nginx'], [['Analytics'], '...",,0,,0,,0,,...,0,0,0,0,0,0,0,0,0,0


# Model 6: One-hot encoding

In [37]:
X, y = df_46k.select_dtypes(exclude='object'), df_46k['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
lr6 = LogisticRegression()
lr6.fit(X_train, y_train)

print(classification_report(lr6.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.89      0.94      0.91      4420
   untrusted       0.94      0.89      0.92      4912

   micro avg       0.92      0.92      0.92      9332
   macro avg       0.92      0.92      0.92      9332
weighted avg       0.92      0.92      0.92      9332



In [39]:
rf6 = RandomForestClassifier(n_estimators=100)
rf6.fit(X_train, y_train)

print(classification_report(rf6.predict(X_test), y_test))

              precision    recall  f1-score   support

     trusted       0.89      0.93      0.91      4473
   untrusted       0.94      0.90      0.92      4859

   micro avg       0.92      0.92      0.92      9332
   macro avg       0.92      0.92      0.92      9332
weighted avg       0.92      0.92      0.92      9332



In [40]:
importances = rf6.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf6.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for i in range(20):
    print(f"No.{i + 1} feature: {X.columns[indices[i]]} ({round(importances[indices[i]], 3)})")

Feature ranking:
No.1 feature: web_tech_counts (0.259)
No.2 feature: category_list_counts (0.226)
No.3 feature: app_list_exist (0.137)
No.4 feature: com (0.122)
No.5 feature: twitter (0.054)
No.6 feature: linkedin (0.039)
No.7 feature: facebook (0.03)
No.8 feature: other (0.019)
No.9 feature: youtube (0.018)
No.10 feature: whois_counts (0.016)
No.11 feature: host_provider_counts (0.012)
No.12 feature: company_name_counts (0.012)
No.13 feature: mail_provider_counts (0.012)
No.14 feature: instagram (0.011)
No.15 feature: registrar_counts (0.008)
No.16 feature: net (0.007)
No.17 feature: com.br (0.003)
No.18 feature: com.au (0.002)
No.19 feature: org (0.002)
No.20 feature: ru (0.001)


# 92% Score achieved! Without any DNS data

# Save our best models
- `lr6`
  
- `rf6`

In [41]:
model_path = Path("../model/saved_models")

lr_path = model_path/'best_LR_model.sav'
rf_path = model_path/'best_RF_model.sav'

In [42]:
pickle.dump(lr6, open(lr_path, 'wb'))
pickle.dump(rf6, open(rf_path, 'wb'))

## Save the first 30 tld category list as `.txt`

In [43]:
save_tld_cat_path = model_path/'tld_cat.txt'

In [44]:
with open(save_tld_cat_path, 'w') as f:
    f.write(",".join(tld_cat))