# Malicous Domain Dataset Preprocessing

In [60]:
import pandas as pd

datasource = "dataset.csv"
data = pd.read_csv(datasource)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90000 entries, 0 to 89999
Data columns (total 34 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Domain               90000 non-null  int64  
 1   DNSRecordType        90000 non-null  object 
 2   MXDnsResponse        90000 non-null  bool   
 3   TXTDnsResponse       90000 non-null  bool   
 4   HasSPFInfo           90000 non-null  bool   
 5   HasDkimInfo          90000 non-null  bool   
 6   HasDmarcInfo         90000 non-null  bool   
 7   Ip                   90000 non-null  int64  
 8   DomainInAlexaDB      90000 non-null  bool   
 9   CommonPorts          90000 non-null  bool   
 10  CountryCode          60948 non-null  object 
 11  RegisteredCountry    12226 non-null  object 
 12  CreationDate         90000 non-null  int64  
 13  LastUpdateDate       90000 non-null  int64  
 14  ASN                  90000 non-null  int64  
 15  HttpResponseCode     90000 non-null 

In [61]:
data.describe()

Unnamed: 0,Domain,Ip,CreationDate,LastUpdateDate,ASN,HttpResponseCode,SubdomainNumber,Entropy,EntropyOfSubDomains,StrangeCharacters,ConsoantRatio,NumericRatio,SpecialCharRatio,VowelRatio,ConsoantSequence,VowelSequence,NumericSequence,SpecialCharSequence,DomainLength,Class
count,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0
mean,44999.5,13479.648033,1.933611,2.365744,23335.808167,0.667033,103.0692,2.866844,0.003178,3.498011,0.459519,0.144281,0.006526,0.261528,2.719222,1.342756,1.516478,0.112378,26.440422,0.5
std,25980.906451,4160.26641,1.997232,1.935509,37004.865724,1.203285,4243.802846,0.488291,0.081042,4.471591,0.146031,0.147331,0.026162,0.0986,1.699339,0.554527,1.538932,0.431967,22.341135,0.500003
min,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
25%,22499.75,11709.75,0.0,0.0,-1.0,0.0,0.0,3.0,0.0,0.0,0.3,0.0,0.0,0.2,2.0,1.0,0.0,0.0,15.0,0.0
50%,44999.5,14626.0,0.0,4.0,26228.0,0.0,0.0,3.0,0.0,1.0,0.5,0.1,0.0,0.2,2.0,1.0,1.0,0.0,24.0,0.5
75%,67499.25,16984.0,4.0,4.0,26228.0,2.0,57.0,3.0,0.0,7.0,0.6,0.3,0.0,0.3,3.0,2.0,3.0,0.0,31.0,1.0
max,89999.0,16984.0,4.0,4.0,398108.0,5.0,661909.0,5.0,3.0,124.0,1.0,0.8,0.9,0.8,37.0,7.0,45.0,61.0,153.0,1.0


## Feature Cleaning and Selection

### Domain

|  Type   | Default Value |
| :-----: | :-----------: |
| Integer | N/A           |

The anonymized domain name. Anonymized by mapping a unique domain name to a unique integer (ie `google.com -> 1`).

In [62]:
# Are there any duplicate rows?
print("Duplicate values: %d" % (data["Domain"].size - data["Domain"].unique().size))

# Are there any null values?
print("Null values: %d" % (data["Domain"].isnull().sum()))

Duplicate values: 0
Null values: 0


The domain feature is the base feature of this dataset. That is, all other features are derived from the domain name. Thus we do not want to use this feature in a model and we remove it from the dataset.

In [63]:
data = data.drop(columns=["Domain"])

### DNSRecordType

|  Type   | Default Value |
| :-----: | :-----------: |
| Text    | N/A           |

The DNS record type. Types are one of the following:

|  Type   | Description |
| :-----: | :---------: |
| A       | IPv4 Record |
| AAAA    | IPv6 Record |
| CNAME   | Canonical Name Record |
| MX      | Mail Exchange Record  |

In [64]:
# Get class breakdown of domains based on this feature
data.groupby(["DNSRecordType", "Class"]).agg({"Class": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Class
DNSRecordType,Class,Unnamed: 2_level_1
A,0,45000
A,1,4529
CNAME,1,35997
MX,1,4474


The dataset does not have a good mix of records -> classes. Both CNAME and MX have zero non malicious data points. Thus this feature must be discarded.

In [57]:
data = data.drop(columns=["DNSRecordType"])

### MXDnsResponse

|  Type   | Default Value |
| :-----: | :-----------: |
| Boolean | False         |

Whether or not a request for MX info returns information.

In [58]:
# Get class breakdown of domains based on this feature
data.groupby(["MXDnsResponse", "Class"]).agg({"Class": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Class
MXDnsResponse,Class,Unnamed: 2_level_1
False,0,30400
False,1,37057
True,0,14600
True,1,7943


### TXTDnsResponse

|  Type   | Default Value |
| :-----: | :-----------: |
| Boolean | False         |

Whether or not a request for TXT info returns information.

In [52]:
# Get class breakdown of domains based on this feature
data.groupby(["TXTDnsResponse", "Class"]).agg({"Class": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Class
TXTDnsResponse,Class,Unnamed: 2_level_1
False,0,34388
False,1,10171
True,0,10612
True,1,34829


### HasSPFInfo

|  Type   | Default Value |
| :-----: | :-----------: |
| Boolean | False         |

If the DNS record has the Sender Policy Framework attribute. SPF is used to help valdiate that a mail server is authorized to send messages on behalf of a domain.

In [66]:
# Get class breakdown of domains based on this feature
data.groupby(["HasSPFInfo", "Class"]).agg({"Class": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Class
HasSPFInfo,Class,Unnamed: 2_level_1
False,0,35218
False,1,10561
True,0,9782
True,1,34439
