The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.

In [222]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.experimental import enable_hist_gradient_boosting
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [223]:
# set up display area to show dataframe in jupyter qtconsole
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [224]:
# We need to explicitly specify data types when reading csv, otherwise it is very memory consuming
# and we will get the warning "Specify dtype option on import or set low_memory=False"
# So, we will manually defined the data types

# P.S. I have loaded the sample data and exported train_data.dtypes
# these are the data types for fast loading

datatypes = {
    'ProductName': np.int64,
    'IsBeta': np.int64,
    'RtpStateBitfield': np.float64,
    'IsSxsPassiveMode': np.int64,
    'AVProductStatesIdentifier': np.float64,
    'AVProductsInstalled': np.float64,
    'AVProductsEnabled': np.float64,
    'HasTpm': np.int64,
    'CountryIdentifier': np.int64,
    'CityIdentifier': np.int64,
    'OrganizationIdentifier': np.int64,
    'GeoNameIdentifier': np.float64,
    'LocaleEnglishNameIdentifier': np.int64,
    'Platform': np.int64,
    'Processor': np.int64,
    'OsBuild': np.int64,
    'OsSuite': np.int64,
    'OsPlatformSubRelease': np.int64,
    'SkuEdition': np.int64,
    'IsProtected': np.float64,
    'AutoSampleOptIn': np.int64,
    'SMode': np.int64,
    'IeVerIdentifier': np.float64,
    'SmartScreen': np.int64,
    'Firewall': np.float64,
    'UacLuaenable': np.float64,
    'Census_MDC2FormFactor': np.int64,
    'Census_DeviceFamily': np.int64,
    'Census_OEMNameIdentifier': np.float64,
    'Census_OEMModelIdentifier': np.float64,
    'Census_ProcessorCoreCount': np.float64,
    'Census_ProcessorManufacturerIdentifier': np.float64,
    'Census_ProcessorModelIdentifier': np.float64,
    'Census_PrimaryDiskTotalCapacity': np.float64,
    'Census_PrimaryDiskTypeName': np.int64,
    'Census_SystemVolumeTotalCapacity': np.float64,
    'Census_HasOpticalDiskDrive': np.int64,
    'Census_TotalPhysicalRAM': np.float64,
    'Census_ChassisTypeName': np.int64,
    'Census_InternalPrimaryDiagonalDisplaySizeInInches': np.float64,
    'Census_InternalPrimaryDisplayResolutionHorizontal': np.float64,
    'Census_InternalPrimaryDisplayResolutionVertical': np.float64,
    'Census_PowerPlatformRoleName': np.int64,
    'Census_InternalBatteryNumberOfCharges': np.float64,
    'Census_OSArchitecture': np.int64,
    'Census_OSBranch': np.int64,
    'Census_OSBuildNumber': np.int64,
    'Census_OSBuildRevision': np.int64,
    'Census_OSEdition': np.int64,
    'Census_OSSkuName': np.int64,
    'Census_OSInstallTypeName': np.int64,
    'Census_OSInstallLanguageIdentifier': np.float64,
    'Census_OSUILocaleIdentifier': np.int64,
    'Census_OSWUAutoUpdateOptionsName': np.int64,
    'Census_IsPortableOperatingSystem': np.int64,
    'Census_GenuineStateName': np.int64,
    'Census_ActivationChannel': np.int64,
    'Census_IsFlightsDisabled': np.float64,
    'Census_FlightRing': np.int64,
    'Census_ThresholdOptIn': np.float64,
    'Census_FirmwareManufacturerIdentifier': np.float64,
    'Census_FirmwareVersionIdentifier': np.float64,
    'Census_IsSecureBootEnabled': np.int64,
    'Census_IsWIMBootEnabled': np.float64,
    'Census_IsVirtualDevice': np.float64,
    'Census_IsTouchEnabled': np.int64,
    'Census_IsPenCapable': np.int64,
    'Census_IsAlwaysOnAlwaysConnectedCapable': np.float64,
    'Wdft_IsGamer': np.int64,
    'Wdft_RegionIdentifier': np.int64,
    'HasDetections': np.int64,
    'EngineVersion_1': np.int64,
    'EngineVersion_2': np.int64,
    'EngineVersion_3': np.int64,
    'EngineVersion_4': np.int64,
    'AppVersion_1': np.int64,
    'AppVersion_2': np.int64,
    'AppVersion_3': np.int64,
    'AppVersion_4': np.int64,
    'AvSigVersion_1': np.int64,
    'AvSigVersion_2': np.float64,
    'AvSigVersion_3': np.int64,
    'AvSigVersion_4': np.int64,
    'OsVer_1': np.int64,
    'OsVer_2': np.int64,
    'OsVer_3': np.int64,
    'OsVer_4': np.int64,
    'OsBuildLab_1': np.float64,
    'OsBuildLab_2': np.float64,
    'OsBuildLab_3': np.int64,
    'OsBuildLab_4': np.int64,
    'OsBuildLab_5': np.float64,
    'OsBuildLab_6': np.float64,
    'Census_OSVersion_1': np.int64,
    'Census_OSVersion_2': np.int64,
    'Census_OSVersion_3': np.int64,
    'Census_OSVersion_4': np.int64
}

full_features = pd.read_csv("./csv/train_v6.csv", dtype=datatypes, index_col="MachineIdentifier")
#full_features = pd.read_csv("./csv/train.csv", dtype=datatypes, nrows=200000, index_col="MachineIdentifier")

In [225]:
full_labels = full_features["HasDetections"]

# Dropping labels ["HasDetections"] from training dataset
full_features = full_features.drop(["HasDetections"], axis=1)

In [226]:
print (full_features.shape)

(8921483, 96)


In [227]:
train_count = int(len(full_features) * 0.8)

train_features = full_features.values[:train_count]
train_labels = full_labels.values[:train_count]

test_features = full_features.values[train_count:]
test_labels = full_labels.values[train_count:]

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(train_features, train_labels)
all_columns_score = clf.score(test_features, test_labels)
    
print ("All columns (original)", train_features.shape, "HistGradientBoostingClassifier", all_columns_score*100)


All columns (original) (7137186, 96) HistGradientBoostingClassifier 64.2393054519511


In [228]:
full_features['Census_CapacityHyperparameter1'] = full_features['Census_SystemVolumeTotalCapacity'] / full_features['Census_PrimaryDiskTotalCapacity'] * 100
#full_features['Census_CapacityHyperparameter1'].head(10)

In [229]:
#full_features['Census_CapacityHyperparameter2'] = full_features['Census_PrimaryDiskTotalCapacity'] - full_features['Census_SystemVolumeTotalCapacity'] 
#full_features['Census_CapacityHyperparameter2'].head(10)

In [230]:
full_features['Census_RezolutionHyperparameter1'] = full_features['Census_InternalPrimaryDisplayResolutionHorizontal'] / full_features['Census_InternalPrimaryDisplayResolutionVertical'] 
#full_features['Census_RezolutionHyperparameter1'].head(10)

In [231]:
full_features['Census_RezolutionHyperparameter2'] = full_features['Census_InternalPrimaryDisplayResolutionHorizontal'] / full_features['Census_InternalPrimaryDiagonalDisplaySizeInInches'] 
full_features['Census_RezolutionHyperparameter2'].head(10)

MachineIdentifier
0000028988387b115f69f31a3bf04f09     76.190476
000007535c3f730efa9ea0b7ef1bd645     98.273381
000007905a28d863f6d0d597892cd692     89.302326
00000b11598a75ea8ba1beea8459149f     73.837838
000014a5f00daa18e76b81417eeb99fc     97.571429
000016191b897145d069102325cab760     89.302326
0000161e8abf8d8b89c5ab8787fd712b     93.023256
000019515bc8f95851aff6de873405e8     88.129032
00001a027a0ab970c408182df8484fce    123.076923
00001a18d69bb60bda9779408dcf02ac     88.129032
Name: Census_RezolutionHyperparameter2, dtype: float64

In [232]:
#full_features['Census_RezolutionHyperparameter3'] = full_features['Census_InternalPrimaryDisplayResolutionVertical'] / full_features['Census_InternalPrimaryDiagonalDisplaySizeInInches'] 
#full_features['Census_RezolutionHyperparameter3'].head(10)

In [233]:
print (full_features.shape)

(8921483, 99)


In [234]:
train_features = full_features.values[:train_count]
train_labels = full_labels.values[:train_count]

test_features = full_features.values[train_count:]
test_labels = full_labels.values[train_count:]

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(train_features, train_labels)
all_columns_score = clf.score(test_features, test_labels)
    
print ("All columns (hyperparameter)", train_features.shape, "HistGradientBoostingClassifier", all_columns_score*100)


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').