The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.experimental import enable_hist_gradient_boosting
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [2]:
# set up display area to show dataframe in jupyter qtconsole
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Columns

Unavailable or self-documenting column names are marked with an "NA".

    MachineIdentifier - Individual machine ID
    ProductName - Defender state information e.g. win8defender
    EngineVersion - Defender state information e.g. 1.1.12603.0
    AppVersion - Defender state information e.g. 4.9.10586.0
    AvSigVersion - Defender state information e.g. 1.217.1014.0
    IsBeta - Defender state information e.g. false
    RtpStateBitfield - NA
    IsSxsPassiveMode - NA
    DefaultBrowsersIdentifier - ID for the machine's default browser
    AVProductStatesIdentifier - ID for the specific configuration of a user's antivirus software
    AVProductsInstalled - NA
    AVProductsEnabled - NA
    HasTpm - True if machine has tpm
    CountryIdentifier - ID for the country the machine is located in
    CityIdentifier - ID for the city the machine is located in
    OrganizationIdentifier - ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries
    GeoNameIdentifier - ID for the geographic region a machine is located in
    LocaleEnglishNameIdentifier - English name of Locale ID of the current user
    Platform - Calculates platform name (of OS related properties and processor property)
    Processor - This is the process architecture of the installed operating system
    OsVer - Version of the current operating system
    OsBuild - Build of the current operating system
    OsSuite - Product suite mask for the current operating system.
    OsPlatformSubRelease - Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)
    OsBuildLab - Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022
    SkuEdition - The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.
    IsProtected - This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.
    AutoSampleOptIn - This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+
    PuaMode - Pua Enabled mode from the service
    SMode - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed
    IeVerIdentifier - NA
    SmartScreen - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.
    Firewall - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.
    UacLuaenable - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.
    Census_MDC2FormFactor - A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)
    Census_DeviceFamily - AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone
    Census_OEMNameIdentifier - NA
    Census_OEMModelIdentifier - NA
    Census_ProcessorCoreCount - Number of logical cores in the processor
    Census_ProcessorManufacturerIdentifier - NA
    Census_ProcessorModelIdentifier - NA
    Census_ProcessorClass - A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated
    Census_PrimaryDiskTotalCapacity - Amount of disk space on primary disk of the machine in MB
    Census_PrimaryDiskTypeName - Friendly name of Primary Disk Type - HDD or SSD
    Census_SystemVolumeTotalCapacity - The size of the partition that the System volume is installed on in MB
    Census_HasOpticalDiskDrive - True indicates that the machine has an optical disk drive (CD/DVD)
    Census_TotalPhysicalRAM - Retrieves the physical RAM in MB
    Census_ChassisTypeName - Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx
    Census_InternalPrimaryDiagonalDisplaySizeInInches - Retrieves the physical diagonal length in inches of the primary display
    Census_InternalPrimaryDisplayResolutionHorizontal - Retrieves the number of pixels in the horizontal direction of the internal display.
    Census_InternalPrimaryDisplayResolutionVertical - Retrieves the number of pixels in the vertical direction of the internal display
    Census_PowerPlatformRoleName - Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device
    Census_InternalBatteryType - NA
    Census_InternalBatteryNumberOfCharges - NA
    Census_OSVersion - Numeric OS version Example - 10.0.10130.0
    Census_OSArchitecture - Architecture on which the OS is based. Derived from OSVersionFull. Example - amd64
    Census_OSBranch - Branch of the OS extracted from the OsVersionFull. Example - OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005
    Census_OSBuildNumber - OS Build number extracted from the OsVersionFull. Example - OsBuildNumber = 10512 or 10240
    Census_OSBuildRevision - OS Build revision extracted from the OsVersionFull. Example - OsBuildRevision = 1000 or 16458
    Census_OSEdition - Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise
    Census_OSSkuName - OS edition friendly name (currently Windows only)
    Census_OSInstallTypeName - Friendly description of what install was used on the machine i.e. clean
    Census_OSInstallLanguageIdentifier - NA
    Census_OSUILocaleIdentifier - NA
    Census_OSWUAutoUpdateOptionsName - Friendly name of the WindowsUpdate auto-update settings on the machine.
    Census_IsPortableOperatingSystem - Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.
    Census_GenuineStateName - Friendly name of OSGenuineStateID. 0 = Genuine
    Census_ActivationChannel - Retail license key or Volume license key for a machine.
    Census_IsFlightingInternal - NA
    Census_IsFlightsDisabled - Indicates if the machine is participating in flighting.
    Census_FlightRing - The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring.
    Census_ThresholdOptIn - NA
    Census_FirmwareManufacturerIdentifier - NA
    Census_FirmwareVersionIdentifier - NA
    Census_IsSecureBootEnabled - Indicates if Secure Boot mode is enabled.
    Census_IsWIMBootEnabled - NA
    Census_IsVirtualDevice - Identifies a Virtual Machine (machine learning model)
    Census_IsTouchEnabled - Is this a touch device ?
    Census_IsPenCapable - Is the device capable of pen input ?
    Census_IsAlwaysOnAlwaysConnectedCapable - Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected .
    Wdft_IsGamer - Indicates whether the device is a gamer device or not based on its hardware combination.
    Wdft_RegionIdentifier - NA


In [7]:
# We need to explicitly specify data types when reading csv, otherwise it is very memory consuming
# and we will get the warning "Specify dtype option on import or set low_memory=False"
# So, we will manually defined the data types

# P.S. I have loaded the sample data and exported train_data.dtypes
# these are the data types for fast loading

datatypes = {
    'ProductName': np.int64,
    'IsBeta': np.int64,
    'RtpStateBitfield': np.float64,
    'IsSxsPassiveMode': np.int64,
    'AVProductStatesIdentifier': np.float64,
    'AVProductsInstalled': np.float64,
    'AVProductsEnabled': np.float64,
    'HasTpm': np.int64,
    'CountryIdentifier': np.int64,
    'CityIdentifier': np.int64,
    'OrganizationIdentifier': np.int64,
    'GeoNameIdentifier': np.float64,
    'LocaleEnglishNameIdentifier': np.int64,
    'Platform': np.int64,
    'Processor': np.int64,
    'OsBuild': np.int64,
    'OsSuite': np.int64,
    'OsPlatformSubRelease': np.int64,
    'SkuEdition': np.int64,
    'IsProtected': np.float64,
    'AutoSampleOptIn': np.int64,
    'SMode': np.int64,
    'IeVerIdentifier': np.float64,
    'SmartScreen': np.int64,
    'Firewall': np.float64,
    'UacLuaenable': np.float64,
    'Census_MDC2FormFactor': np.int64,
    'Census_DeviceFamily': np.int64,
    'Census_OEMNameIdentifier': np.float64,
    'Census_OEMModelIdentifier': np.float64,
    'Census_ProcessorCoreCount': np.float64,
    'Census_ProcessorManufacturerIdentifier': np.float64,
    'Census_ProcessorModelIdentifier': np.float64,
    'Census_PrimaryDiskTotalCapacity': np.float64,
    'Census_PrimaryDiskTypeName': np.int64,
    'Census_SystemVolumeTotalCapacity': np.float64,
    'Census_HasOpticalDiskDrive': np.int64,
    'Census_TotalPhysicalRAM': np.float64,
    'Census_ChassisTypeName': np.int64,
    'Census_InternalPrimaryDiagonalDisplaySizeInInches': np.float64,
    'Census_InternalPrimaryDisplayResolutionHorizontal': np.float64,
    'Census_InternalPrimaryDisplayResolutionVertical': np.float64,
    'Census_PowerPlatformRoleName': np.int64,
    'Census_InternalBatteryNumberOfCharges': np.float64,
    'Census_OSArchitecture': np.int64,
    'Census_OSBranch': np.int64,
    'Census_OSBuildNumber': np.int64,
    'Census_OSBuildRevision': np.int64,
    'Census_OSEdition': np.int64,
    'Census_OSSkuName': np.int64,
    'Census_OSInstallTypeName': np.int64,
    'Census_OSInstallLanguageIdentifier': np.float64,
    'Census_OSUILocaleIdentifier': np.int64,
    'Census_OSWUAutoUpdateOptionsName': np.int64,
    'Census_IsPortableOperatingSystem': np.int64,
    'Census_GenuineStateName': np.int64,
    'Census_ActivationChannel': np.int64,
    'Census_IsFlightsDisabled': np.float64,
    'Census_FlightRing': np.int64,
    'Census_ThresholdOptIn': np.float64,
    'Census_FirmwareManufacturerIdentifier': np.float64,
    'Census_FirmwareVersionIdentifier': np.float64,
    'Census_IsSecureBootEnabled': np.int64,
    'Census_IsWIMBootEnabled': np.float64,
    'Census_IsVirtualDevice': np.float64,
    'Census_IsTouchEnabled': np.int64,
    'Census_IsPenCapable': np.int64,
    'Census_IsAlwaysOnAlwaysConnectedCapable': np.float64,
    'Wdft_IsGamer': np.int64,
    'Wdft_RegionIdentifier': np.int64,
    'HasDetections': np.int64,
    'EngineVersion_1': np.int64,
    'EngineVersion_2': np.int64,
    'EngineVersion_3': np.int64,
    'EngineVersion_4': np.int64,
    'AppVersion_1': np.int64,
    'AppVersion_2': np.int64,
    'AppVersion_3': np.int64,
    'AppVersion_4': np.int64,
    'AvSigVersion_1': np.int64,
    'AvSigVersion_2': np.float64,
    'AvSigVersion_3': np.int64,
    'AvSigVersion_4': np.int64,
    'OsVer_1': np.int64,
    'OsVer_2': np.int64,
    'OsVer_3': np.int64,
    'OsVer_4': np.int64,
    'OsBuildLab_1': np.float64,
    'OsBuildLab_2': np.float64,
    'OsBuildLab_3': np.int64,
    'OsBuildLab_4': np.int64,
    'OsBuildLab_5': np.float64,
    'OsBuildLab_6': np.float64,
    'Census_OSVersion_1': np.int64,
    'Census_OSVersion_2': np.int64,
    'Census_OSVersion_3': np.int64,
    'Census_OSVersion_4': np.int64
}

full_features = pd.read_csv("./csv/train_v6.csv", dtype=datatypes, index_col="MachineIdentifier")
#full_features = pd.read_csv("./csv/train.csv", dtype=datatypes, nrows=200000, index_col="MachineIdentifier")

In [8]:
# Shuffle the data
#np.random.seed(0)

shuffle = np.random.permutation(np.arange(full_features.shape[0]))[:500000]
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

In [9]:
full_labels = full_features["HasDetections"]

# Dropping labels ["HasDetections"] from training dataset
full_features = full_features.drop(["HasDetections"], axis=1)

In [10]:
print (full_features.shape)

(500000, 96)


In [13]:
full_features.head(10)

Unnamed: 0_level_0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1
5503064bc84f28cae97084ecc0c2cd08,1,0,7.0,0,49480.0,2.0,1.0,1,141,150142,27,167.0,227,10,1,15063,768,502,52,1.0,0,0,107.0,4,1.0,1.0,2,1,585.0,189218.0,4.0,5.0,2382.0,953869.0,1,952728.0,0,4096.0,1,15.5,1366.0,768.0,1,0.0,1,4,15063,608,3,3,5,9.0,34,2,0,1,2,0.0,1,0.0,556.0,63103.0,1,0.0,0.0,0,0,0.0,0,10,1,1,13504,0,4,11,15063,447,1,237.0,0,0,10,0,0,0,15063.0,0.0,1,4,170317.0,1834.0,10,0,15063,608
89849ff9e34b814084dff690e4785d71,1,0,7.0,0,53447.0,1.0,1.0,1,104,0,0,53.0,42,10,1,16299,768,503,52,1.0,0,0,111.0,4,1.0,1.0,2,1,525.0,331216.0,4.0,5.0,2412.0,114473.0,2,53857.0,0,4096.0,1,15.5,1366.0,768.0,1,1.0,1,2,16299,15,1,1,2,37.0,158,2,0,1,1,0.0,1,0.0,142.0,69939.0,1,0.0,0.0,0,0,0.0,0,7,1,1,15300,5,4,12,16299,15,1,275.0,1198,0,10,0,0,0,16299.0,15.0,1,3,170928.0,1534.0,10,0,16299,15
f6d865386bc8ffa40aef89c990d69e18,1,0,7.0,0,46413.0,2.0,1.0,1,211,24475,27,29.0,215,10,1,17134,768,504,52,1.0,0,0,137.0,6,1.0,1.0,2,1,2102.0,242491.0,4.0,1.0,212.0,476940.0,1,456871.0,0,4096.0,1,72.3,1360.0,768.0,1,0.0,1,1,17134,165,1,1,1,8.0,31,1,0,1,2,0.0,1,0.0,554.0,33084.0,1,0.0,0.0,0,0,0.0,0,10,1,1,15100,1,4,18,1807,18075,1,273.0,950,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,165
51ba01c7472e0761be93c4360d39656d,1,0,7.0,0,53447.0,1.0,1.0,1,6,147284,0,277.0,75,10,1,14393,768,501,52,1.0,0,0,103.0,6,1.0,1.0,2,1,3799.0,207053.0,2.0,5.0,4337.0,305245.0,1,104646.0,0,4096.0,1,15.5,1366.0,768.0,1,0.0,1,5,14393,1358,1,1,3,8.0,31,3,0,1,1,0.0,1,0.0,803.0,63599.0,0,0.0,0.0,0,0,0.0,0,3,1,1,14901,4,4,16,17656,18052,1,269.0,1369,0,10,0,0,0,14393.0,1358.0,1,5,170602.0,2252.0,10,0,14393,1358
153186bebfad6b20b0da390018d1c5e7,1,0,7.0,0,53200.0,3.0,2.0,1,43,12571,18,53.0,42,10,1,10586,768,202,52,1.0,0,0,74.0,6,1.0,1.0,2,1,525.0,331298.0,4.0,5.0,2697.0,953869.0,1,953093.0,0,4096.0,1,15.5,1366.0,768.0,1,23.0,1,7,10586,1176,4,4,6,37.0,158,2,0,1,2,0.0,1,0.0,142.0,70437.0,1,0.0,0.0,0,0,0.0,1,7,1,1,15200,1,4,9,10586,1106,1,275.0,1080,0,10,0,0,0,10586.0,1176.0,1,6,170913.0,1848.0,10,0,10586,1176
c4ed0b339cec2eb5a4f2b3b370f5023d,1,0,7.0,0,53447.0,1.0,1.0,1,43,63245,27,53.0,42,10,1,17134,256,504,55,1.0,0,0,137.0,4,1.0,1.0,1,1,4909.0,317701.0,12.0,1.0,1266.0,114473.0,2,59923.0,0,16384.0,2,27.0,1920.0,1080.0,2,4294967000.0,1,1,17134,286,2,2,2,37.0,158,1,0,1,3,0.0,1,0.0,142.0,53543.0,0,0.0,0.0,0,0,0.0,0,0,1,1,14600,4,4,13,17134,1,1,263.0,48,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,286
2db3be125d84dd08a08e15a5c6993e16,1,0,7.0,0,7945.0,2.0,1.0,1,44,16668,27,198.0,229,10,1,17134,768,504,52,1.0,0,0,137.0,6,1.0,1.0,2,1,4730.0,311197.0,4.0,5.0,2283.0,476940.0,1,465538.0,0,4096.0,1,13.9,1024.0,768.0,1,8.0,1,1,17134,285,3,3,3,9.0,34,1,0,1,1,0.0,1,0.0,556.0,13299.0,1,0.0,0.0,0,0,0.0,0,10,1,1,15200,1,4,18,1807,18075,1,275.0,1373,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,285
58af433c5645ad564be077e6ee2836e3,1,0,7.0,0,62773.0,1.0,1.0,1,176,71555,0,277.0,75,10,1,10586,256,202,55,0.0,0,0,78.0,6,0.0,1.0,2,1,2206.0,246508.0,4.0,5.0,2459.0,476940.0,1,99650.0,0,4096.0,1,13.2,1366.0,768.0,1,0.0,1,6,10586,446,2,2,2,8.0,31,2,0,2,3,0.0,1,0.0,500.0,33080.0,0,0.0,0.0,0,0,0.0,1,11,1,1,12805,0,4,9,10586,0,1,223.0,2112,0,10,0,0,0,10586.0,420.0,1,6,160527.0,1834.0,10,0,10586,446
88ac3c9ad4df37bac240359b9495f1d9,1,0,7.0,0,57629.0,2.0,1.0,1,99,95183,27,277.0,75,10,1,17134,768,504,52,1.0,0,0,137.0,6,1.0,1.0,2,1,2102.0,228975.0,4.0,5.0,3394.0,476940.0,1,464306.0,0,4096.0,1,15.5,1366.0,768.0,1,0.0,1,1,17134,254,1,1,4,8.0,31,1,0,1,2,0.0,1,0.0,554.0,33135.0,1,0.0,0.0,0,0,0.0,1,10,1,1,15200,1,4,13,17134,228,1,275.0,112,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,254
90b26b8e3780a5a2fe313f8a32db2d7d,1,0,7.0,0,53447.0,1.0,1.0,1,88,47844,18,117.0,74,10,1,17134,256,504,55,1.0,0,0,137.0,4,1.0,1.0,2,1,525.0,225820.0,8.0,5.0,3038.0,457862.0,2,457296.0,0,16384.0,1,15.5,1920.0,1080.0,1,93.0,1,1,17134,191,2,2,1,7.0,30,1,0,1,4,0.0,1,0.0,142.0,38267.0,0,0.0,0.0,0,0,0.0,1,3,1,1,15200,1,4,18,1807,18075,1,275.0,398,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,191


In [14]:
# Let's see some details of the loaded data
full_features.describe()

Unnamed: 0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,1.010768,8e-06,6.848428,0.016994,47878.182768,1.325496,1.020798,0.987796,107.966926,78448.572958,17.207562,169.624696,122.840932,13.190182,1.182072,15715.117326,575.393594,480.020272,52.617264,0.945622,3.8e-05,0.000416,126.630154,4.85206,0.9784,13.70904,2.20314,1.001636,2226.51526,240047.890412,3.991062,4.532992,2371.678976,13560700.0,1.419256,375287.5,0.07704,6095.82945,1.544844,16.657827,1547.304226,897.383404,1.36996,1083380000.0,1.181836,2.649696,15830.135682,979.17327,1.976808,1.950646,2.946778,14.576686,60.505628,1.88599,0.00053,1.144648,1.59792,1e-05,1.015174,9e-05,397.69947,32977.69607,0.48645,0.0,0.00701,0.12623,0.038356,0.057626,0.273766,7.62061,1.0,1.0,15074.136806,1.298048,4.0,15.857228,5659.66728,14087.257628,0.999994,272.346234,934.18935,0.0,9.870104,0.076146,0.010916,0.000396,15715.112006,1431.051764,1.182072,3.089268,176500.768864,1777.4406,10.0,0.0,15830.135682,979.17327
std,0.103711,0.002828,1.017994,0.129249,13985.378116,0.521997,0.166918,0.109796,63.06447,50404.995008,12.398733,89.342689,69.296696,80.877373,0.575297,2193.749333,248.029848,80.323318,5.910511,0.226762,0.006164,0.020392,42.654334,1.269545,0.145374,8990.242,1.322468,0.040414,1308.499423,71974.767708,2.117134,1.284956,839.895051,9226205000.0,0.620752,325771.0,0.266655,5016.139709,1.47995,5.891691,368.550552,215.414988,0.626443,1865308000.0,0.574961,2.029657,1964.25359,2944.356339,1.145848,1.086996,1.818254,10.191358,45.044029,0.936383,0.023016,0.428142,0.758367,0.003162,0.303058,0.009486,222.493863,21012.475443,0.499817,0.0,0.083432,0.332109,0.192054,0.233035,0.445891,4.691989,0.0,0.0,278.055545,1.023975,0.0,3.328196,6262.448633,7381.928629,0.002449,6.066133,533.469939,0.0,0.709022,0.448736,0.125574,0.154842,2193.749117,4625.395886,0.575297,3.149581,6039.583834,210.870725,0.0,0.0,1964.25359,2944.356339
min,1.0,0.0,0.0,0.0,6.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,7.0,1.0,7600.0,16.0,201.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,1.0,1.0,23.0,22.0,1.0,1.0,17.0,7326.0,1.0,0.0,0.0,512.0,-1.0,3.0,-1.0,-1.0,0.0,0.0,1.0,1.0,10240.0,0.0,1.0,1.0,1.0,1.0,5.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,9.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,10302.0,0.0,4.0,4.0,204.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,7600.0,0.0,1.0,1.0,90713.0,100.0,10.0,0.0,10240.0,0.0
25%,1.0,0.0,7.0,0.0,49480.0,1.0,1.0,1.0,51.0,31434.0,0.0,89.0,75.0,10.0,1.0,15063.0,256.0,503.0,52.0,1.0,0.0,0.0,111.0,4.0,1.0,1.0,2.0,1.0,1443.0,189828.0,2.0,5.0,1998.0,239372.0,1.0,120324.0,0.0,4096.0,1.0,13.9,1366.0,768.0,1.0,0.0,1.0,1.0,15063.0,167.0,1.0,1.0,1.0,8.0,31.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,142.0,13187.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,15100.0,1.0,4.0,13.0,1807.0,17443.0,1.0,273.0,495.0,0.0,10.0,0.0,0.0,0.0,15063.0,1.0,1.0,1.0,170928.0,1804.0,10.0,0.0,15063.0,167.0
50%,1.0,0.0,7.0,0.0,53447.0,1.0,1.0,1.0,97.0,77866.0,18.0,181.0,88.0,10.0,1.0,16299.0,768.0,503.0,52.0,1.0,0.0,0.0,117.0,4.0,1.0,1.0,2.0,1.0,2102.0,248045.0,4.0,5.0,2503.0,476940.0,1.0,248643.0,0.0,4096.0,1.0,15.5,1366.0,768.0,1.0,0.0,1.0,2.0,16299.0,285.0,2.0,2.0,3.0,9.0,34.0,2.0,0.0,1.0,1.0,0.0,1.0,0.0,486.0,33075.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,15100.0,1.0,4.0,18.0,1807.0,18075.0,1.0,273.0,943.0,0.0,10.0,0.0,0.0,0.0,16299.0,1.0,1.0,2.0,180410.0,1804.0,10.0,0.0,16299.0,285.0
75%,1.0,0.0,7.0,0.0,53447.0,2.0,1.0,1.0,162.0,121753.0,27.0,267.0,182.0,10.0,1.0,17134.0,768.0,504.0,55.0,1.0,0.0,0.0,137.0,6.0,1.0,1.0,2.0,1.0,2668.0,309007.0,4.0,5.0,2867.0,953869.0,2.0,475965.0,0.0,8192.0,2.0,17.2,1920.0,1080.0,2.0,4294967000.0,1.0,4.0,17134.0,547.0,3.0,3.0,4.0,20.0,105.0,3.0,0.0,1.0,2.0,0.0,1.0,0.0,556.0,52200.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,1.0,1.0,15200.0,1.0,4.0,18.0,10586.0,18075.0,1.0,275.0,1379.0,0.0,10.0,0.0,0.0,0.0,17134.0,431.0,1.0,4.0,180410.0,1834.0,10.0,0.0,17134.0,547.0
max,6.0,1.0,35.0,1.0,70486.0,5.0,5.0,1.0,222.0,167962.0,52.0,296.0,283.0,2016.0,3.0,18242.0,784.0,508.0,90.0,1.0,1.0,1.0,429.0,6.0,1.0,6357062.0,11.0,2.0,6143.0,345490.0,192.0,10.0,4472.0,6523912000000.0,3.0,11445950.0,1.0,524288.0,16.0,142.0,7680.0,3840.0,7.0,4294967000.0,3.0,14.0,18242.0,41736.0,12.0,19.0,9.0,39.0,162.0,6.0,1.0,5.0,6.0,1.0,5.0,1.0,1087.0,72091.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,15.0,1.0,1.0,15300.0,6.0,4.0,18.0,17686.0,20082.0,1.0,277.0,6234.0,0.0,10.0,3.0,32.0,72.0,18242.0,24236.0,3.0,39.0,180914.0,2340.0,10.0,0.0,18242.0,41736.0


In [15]:
train_count = 400000 #int(len(full_features) * 0.8)

train_features = full_features.values[:train_count]
test_features  = full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print ("All columns (normalized)", train_features.shape, "HistGradientBoostingClassifier", all_columns_score*100)


All columns (normalized) (400000, 96) (100000, 96) (400000,) (100000,) HistGradientBoostingClassifier 63.893


In [16]:
clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(train_features, train_labels)
all_columns_score = clf.score(test_features, test_labels)
    
print ("All columns (original)", train_features.shape, "HistGradientBoostingClassifier", all_columns_score*100)


All columns (original) (400000, 96) HistGradientBoostingClassifier 63.893


In [17]:
model = PCA(n_components=80)
pca_train_results = np.array(model.fit_transform(train_features))
pca_test_results = np.array(model.transform(test_features))

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(pca_train_results, train_labels)
pca_all_columns_score = clf.score(pca_test_results, test_labels)
    
print ("All columns (PCA)", train_features.shape, "HistGradientBoostingClassifier", pca_all_columns_score*100)


All columns (PCA) (400000, 96) HistGradientBoostingClassifier 62.852


In [29]:
def optimize_score(all_features, labels, current_score, train_count, test_count, level):
    for c in all_features.columns:
        df_features = all_features.drop(c, axis=1)

        train_features = df_features.values[:train_count]
        test_features  = df_features.values[train_count:train_count+test_count]

        train_labels = labels.values[:train_count]
        test_labels = labels.values[train_count:train_count+test_count]
    
        clf = ske.HistGradientBoostingClassifier(random_state=123)
        clf.fit(train_features, train_labels)
        score = clf.score(test_features, test_labels)
    
        #print (df_features.columns)
        print ('Level', level,': Dropping', c, 
               train_features.shape, test_features.shape, "HistGradientBoosting", 
               current_score*100, score*100, score >= current_score)
        
        if score >= current_score:
            optimize_score(df_features, labels, score, train_count, test_count, level + 1)

    print ('Score for level', level, 'is', current_score, 'columns', all_features.columns)
    
# Let's try good old brute force ;)
optimize_score(full_features, full_labels, all_columns_score, 300000,70000,1)


Level 1 : Dropping ProductName (300000, 95) (70000, 95) HistGradientBoosting 63.893 64.24571428571429 True
Level 2 : Dropping IsBeta (300000, 94) (70000, 94) HistGradientBoosting 64.24571428571429 64.24571428571429 True
Level 3 : Dropping RtpStateBitfield (300000, 93) (70000, 93) HistGradientBoosting 64.24571428571429 64.14142857142858 False
Level 3 : Dropping IsSxsPassiveMode (300000, 93) (70000, 93) HistGradientBoosting 64.24571428571429 64.08285714285714 False
Level 3 : Dropping AVProductStatesIdentifier (300000, 93) (70000, 93) HistGradientBoosting 64.24571428571429 63.70285714285714 False
Level 3 : Dropping AVProductsInstalled (300000, 93) (70000, 93) HistGradientBoosting 64.24571428571429 64.06428571428572 False
Level 3 : Dropping AVProductsEnabled (300000, 93) (70000, 93) HistGradientBoosting 64.24571428571429 64.05428571428573 False
Level 3 : Dropping HasTpm (300000, 93) (70000, 93) HistGradientBoosting 64.24571428571429 64.14142857142858 False
Level 3 : Dropping CountryIdentif

KeyboardInterrupt: 