The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.experimental import enable_hist_gradient_boosting
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [2]:
# set up display area to show dataframe in jupyter qtconsole
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Columns

Unavailable or self-documenting column names are marked with an "NA".

    MachineIdentifier - Individual machine ID
    ProductName - Defender state information e.g. win8defender
    EngineVersion - Defender state information e.g. 1.1.12603.0
    AppVersion - Defender state information e.g. 4.9.10586.0
    AvSigVersion - Defender state information e.g. 1.217.1014.0
    IsBeta - Defender state information e.g. false
    RtpStateBitfield - NA
    IsSxsPassiveMode - NA
    DefaultBrowsersIdentifier - ID for the machine's default browser
    AVProductStatesIdentifier - ID for the specific configuration of a user's antivirus software
    AVProductsInstalled - NA
    AVProductsEnabled - NA
    HasTpm - True if machine has tpm
    CountryIdentifier - ID for the country the machine is located in
    CityIdentifier - ID for the city the machine is located in
    OrganizationIdentifier - ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries
    GeoNameIdentifier - ID for the geographic region a machine is located in
    LocaleEnglishNameIdentifier - English name of Locale ID of the current user
    Platform - Calculates platform name (of OS related properties and processor property)
    Processor - This is the process architecture of the installed operating system
    OsVer - Version of the current operating system
    OsBuild - Build of the current operating system
    OsSuite - Product suite mask for the current operating system.
    OsPlatformSubRelease - Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)
    OsBuildLab - Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022
    SkuEdition - The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.
    IsProtected - This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.
    AutoSampleOptIn - This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+
    PuaMode - Pua Enabled mode from the service
    SMode - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed
    IeVerIdentifier - NA
    SmartScreen - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.
    Firewall - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.
    UacLuaenable - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.
    Census_MDC2FormFactor - A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)
    Census_DeviceFamily - AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone
    Census_OEMNameIdentifier - NA
    Census_OEMModelIdentifier - NA
    Census_ProcessorCoreCount - Number of logical cores in the processor
    Census_ProcessorManufacturerIdentifier - NA
    Census_ProcessorModelIdentifier - NA
    Census_ProcessorClass - A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated
    Census_PrimaryDiskTotalCapacity - Amount of disk space on primary disk of the machine in MB
    Census_PrimaryDiskTypeName - Friendly name of Primary Disk Type - HDD or SSD
    Census_SystemVolumeTotalCapacity - The size of the partition that the System volume is installed on in MB
    Census_HasOpticalDiskDrive - True indicates that the machine has an optical disk drive (CD/DVD)
    Census_TotalPhysicalRAM - Retrieves the physical RAM in MB
    Census_ChassisTypeName - Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx
    Census_InternalPrimaryDiagonalDisplaySizeInInches - Retrieves the physical diagonal length in inches of the primary display
    Census_InternalPrimaryDisplayResolutionHorizontal - Retrieves the number of pixels in the horizontal direction of the internal display.
    Census_InternalPrimaryDisplayResolutionVertical - Retrieves the number of pixels in the vertical direction of the internal display
    Census_PowerPlatformRoleName - Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device
    Census_InternalBatteryType - NA
    Census_InternalBatteryNumberOfCharges - NA
    Census_OSVersion - Numeric OS version Example - 10.0.10130.0
    Census_OSArchitecture - Architecture on which the OS is based. Derived from OSVersionFull. Example - amd64
    Census_OSBranch - Branch of the OS extracted from the OsVersionFull. Example - OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005
    Census_OSBuildNumber - OS Build number extracted from the OsVersionFull. Example - OsBuildNumber = 10512 or 10240
    Census_OSBuildRevision - OS Build revision extracted from the OsVersionFull. Example - OsBuildRevision = 1000 or 16458
    Census_OSEdition - Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise
    Census_OSSkuName - OS edition friendly name (currently Windows only)
    Census_OSInstallTypeName - Friendly description of what install was used on the machine i.e. clean
    Census_OSInstallLanguageIdentifier - NA
    Census_OSUILocaleIdentifier - NA
    Census_OSWUAutoUpdateOptionsName - Friendly name of the WindowsUpdate auto-update settings on the machine.
    Census_IsPortableOperatingSystem - Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.
    Census_GenuineStateName - Friendly name of OSGenuineStateID. 0 = Genuine
    Census_ActivationChannel - Retail license key or Volume license key for a machine.
    Census_IsFlightingInternal - NA
    Census_IsFlightsDisabled - Indicates if the machine is participating in flighting.
    Census_FlightRing - The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring.
    Census_ThresholdOptIn - NA
    Census_FirmwareManufacturerIdentifier - NA
    Census_FirmwareVersionIdentifier - NA
    Census_IsSecureBootEnabled - Indicates if Secure Boot mode is enabled.
    Census_IsWIMBootEnabled - NA
    Census_IsVirtualDevice - Identifies a Virtual Machine (machine learning model)
    Census_IsTouchEnabled - Is this a touch device ?
    Census_IsPenCapable - Is the device capable of pen input ?
    Census_IsAlwaysOnAlwaysConnectedCapable - Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected .
    Wdft_IsGamer - Indicates whether the device is a gamer device or not based on its hardware combination.
    Wdft_RegionIdentifier - NA


In [4]:
# We need to explicitly specify data types when reading csv, otherwise it is very memory consuming
# and we will get the warning "Specify dtype option on import or set low_memory=False"
# So, we will manually defined the data types

# P.S. I have loaded the sample data and exported train_data.dtypes
# these are the data types for fast loading

datatypes = {
    'ProductName': str,
    'EngineVersion': str,
    'AppVersion': str,
    'AvSigVersion': str,
    'IsBeta': np.int8,
    'RtpStateBitfield': str,
    'IsSxsPassiveMode': np.int8,
    'DefaultBrowsersIdentifier': str,
    'AVProductStatesIdentifier': str,
    'AVProductsInstalled': str,
    'AVProductsEnabled': str,
    'HasTpm': np.int8,
    'CountryIdentifier': str,
    'CityIdentifier': str,
    'OrganizationIdentifier': str,
    'GeoNameIdentifier': str,
    'LocaleEnglishNameIdentifier': str,
    'Platform': str,
    'Processor': str,
    'OsVer': str,
    'OsBuild': str,
    'OsSuite': str,
    'OsPlatformSubRelease': str,
    'OsBuildLab': str,
    'SkuEdition': str,
    'IsProtected': str,
    'AutoSampleOptIn': np.int8,
    'PuaMode': str,
    'SMode': str,
    'IeVerIdentifier': str,
    'SmartScreen': str,
    'Firewall': str,
    'UacLuaenable': str,
    'Census_MDC2FormFactor': str,
    'Census_DeviceFamily': str,
    'Census_OEMNameIdentifier': str,
    'Census_OEMModelIdentifier': str, 
    'Census_ProcessorCoreCount': str,
    'Census_ProcessorManufacturerIdentifier': str,
    'Census_ProcessorModelIdentifier': str,
    'Census_ProcessorClass': str,
    'Census_PrimaryDiskTotalCapacity': np.float64,
    'Census_PrimaryDiskTypeName': str,
    'Census_SystemVolumeTotalCapacity': np.float64,
    'Census_HasOpticalDiskDrive': np.int8,
    'Census_TotalPhysicalRAM': np.float64,
    'Census_ChassisTypeName': str,
    'Census_InternalPrimaryDiagonalDisplaySizeInInches': str,
    'Census_InternalPrimaryDisplayResolutionHorizontal': str,
    'Census_InternalPrimaryDisplayResolutionVertical': str,
    'Census_PowerPlatformRoleName': str,
    'Census_InternalBatteryType': str,
    'Census_InternalBatteryNumberOfCharges': str,
    'Census_OSVersion': str,
    'Census_OSArchitecture': str,
    'Census_OSBranch': str,
    'Census_OSBuildNumber': str,
    'Census_OSBuildRevision': str,
    'Census_OSEdition': str,
    'Census_OSSkuName': str,
    'Census_OSInstallTypeName': str,
    'Census_OSInstallLanguageIdentifier': str,
    'Census_OSUILocaleIdentifier': str,
    'Census_OSWUAutoUpdateOptionsName': str,
    'Census_IsPortableOperatingSystem': np.int8,
    'Census_GenuineStateName': str,
    'Census_ActivationChannel': str,
    'Census_IsFlightingInternal': str,
    'Census_IsFlightsDisabled': str,
    'Census_FlightRing': str,
    'Census_ThresholdOptIn': str,
    'Census_FirmwareManufacturerIdentifier': str,
    'Census_FirmwareVersionIdentifier': str,
    'Census_IsSecureBootEnabled': np.int8,
    'Census_IsWIMBootEnabled': str,
    'Census_IsVirtualDevice': str,
    'Census_IsTouchEnabled': np.int8,
    'Census_IsPenCapable': np.int8,
    'Census_IsAlwaysOnAlwaysConnectedCapable': str,
    'Wdft_IsGamer': str,
    'Wdft_RegionIdentifier': str,
    'HasDetections': np.int8
}

full_features = pd.read_csv("./csv/train.csv", dtype=datatypes, index_col="MachineIdentifier")
#full_features = pd.read_csv("./csv/train.csv", dtype=datatypes, nrows=200000, index_col="MachineIdentifier")

In [407]:
# Shuffle the data
#np.random.seed(0)

shuffle = np.random.permutation(np.arange(full_features.shape[0]))[:500000]
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

In [None]:
full_labels = full_features["HasDetections"]

# Dropping labels ["HasDetections"] from training dataset
full_features = full_features.drop(["HasDetections"], axis=1)

In [5]:
print (full_features.shape)

(8921483, 82)


In [6]:
# Checking the columns with the most NULL values
print((full_features.isnull().sum()).sort_values(ascending=False).head(20))

PuaMode                                  8919174
Census_ProcessorClass                    8884852
DefaultBrowsersIdentifier                8488045
Census_IsFlightingInternal               7408759
Census_InternalBatteryType               6338429
Census_ThresholdOptIn                    5667325
Census_IsWIMBootEnabled                  5659703
SmartScreen                              3177011
OrganizationIdentifier                   2751518
SMode                                     537759
CityIdentifier                            325409
Wdft_IsGamer                              303451
Wdft_RegionIdentifier                     303451
Census_InternalBatteryNumberOfCharges     268755
Census_FirmwareManufacturerIdentifier     183257
Census_IsFlightsDisabled                  160523
Census_FirmwareVersionIdentifier          160133
Census_OEMModelIdentifier                 102233
Census_OEMNameIdentifier                   95478
Firewall                                   91350
dtype: int64


In [7]:
full_features['PuaMode'].unique()

array([nan, 'on', 'audit'], dtype=object)

In [8]:
full_features['Census_IsFlightingInternal'].unique()

array([nan, '0', '1'], dtype=object)

In [9]:
full_features['Census_InternalBatteryType'].unique()

array([nan, 'lion', 'li-i', '#', 'lip', 'liio', 'vbox', 'li p', 'real',
       'unkn', 'pbac', 'li', 'bq20', 'nimh', '\x04lio', 'lgi0', 'lhp0',
       'ithi', 'batt', 'lipp', 'lipo', '4cel', 'ram', 'lit', 'a140',
       'bad', 'asmb', 'virt', 'ca48', '4ion', 'd', 'a132', 'ÿÿÿÿ', 'cl53',
       'lio', 'li-l', '÷ÿóö', 'í\x03-i', '0x0b', 'lgs0', '3ion', 'ots0',
       'lai0', 'lilo', 'pa50', 'h4°s', '5nm1', 'li-p', 'lhpo', '0ts0',
       'pad0', 'sail', 'p-sn', 'icp3', 'a130', '2337', '\x1f˙˙˙', 'lgl0',
       'l\x15', '@i\uf8f5\uf8f5', 'li\x90o', '4lio', 'lp', 'li?',
       '\x04ion', 'pbso', 'a138', 'li-h', '6ion', '3500', 'h00j',
       'li\x10', 'sams', '\x03ip', '8', '#TAB#', 'l\x06&#TAB#', 'liÿÿ',
       'lÿÿÿ'], dtype=object)

In [10]:
full_features['Census_ThresholdOptIn'].unique()

array([nan, '0', '1'], dtype=object)

In [11]:
full_features['Census_IsWIMBootEnabled'].unique()

array([nan, '0', '1'], dtype=object)

In [12]:
full_features['SMode'].unique()

array(['0', nan, '1'], dtype=object)

In [13]:
full_features['OrganizationIdentifier'].unique()

array(['18', nan, '27', '46', '11', '14', '37', '10', '50', '49', '33',
       '8', '48', '36', '31', '4', '1', '28', '3', '52', '32', '51', '5',
       '2', '47', '44', '16', '40', '20', '22', '29', '26', '21', '39',
       '6', '19', '7', '30', '42', '43', '41', '15', '45', '25', '35',
       '23', '38', '12', '17', '34'], dtype=object)

In [14]:
full_features['Wdft_IsGamer'].unique()

array(['0', '1', nan], dtype=object)

In [15]:
full_features['Wdft_RegionIdentifier'].unique()

array(['10', '8', '3', '1', '15', '7', '11', '2', '12', '4', '13', nan,
       '6', '9', '5', '14'], dtype=object)

In [16]:
full_features['CityIdentifier'].unique()

array(['128035', '1482', '153579', ..., '47472', '147921', '97837'],
      dtype=object)

In [17]:
full_features['Census_InternalBatteryNumberOfCharges'].unique()

array(['4294967295', '1', '0', ..., '27736', '26424', '16807'],
      dtype=object)

In [18]:
# Cleaning up some data

# PuaMode - Potentially Unwanted Applications, if NA, then it is disabled. 99% are NA. So, better to drop it
# Census_ProcessorClass - According to the description - "No longer maintained and updated"
# DefaultBrowsersIdentifier - Almost all values are empty. Therefore we will drop this column
# Census_IsFlightingInternal - whether this is internal or "external" testing ring. Column mostly unused. Will have to drop it
# Census_InternalBatteryType - comtains mostly garbage. Besides, it should not be relevant to attack surface.
# Census_ThresholdOptIn - also mostly unused. Googled it and Threshold was used in first versions of Windows 10. Looks like unused now
# Census_IsWIMBootEnabled - Is it possible to boot from Windows Image? Not relevant to identification of the attacks when 70% of data is emtpy
# SmartScreen - Whether smart screen in explorer is enabled. Should be important. "ExistsNotSet" when null, according to the description
# SMode - Quite relevant field. Will be keeping it
# OrganizationIdentifier - Attacks by organizations should be analyzed. If not filled, will assign "0". 
# Census_InternalBatteryNumberOfCharges - Not relevant. Will drop this column in order not to overtrain
# Census_OSSkuName -  OS edition friendly name (currently Windows only). - Can be removed. Duplicate field
# Census_ChassisTypeName - Census_MDC2FormFactor gives better information. Let's remove this field

full_features['PuaMode'] = full_features['PuaMode'].fillna('off')
full_features['SmartScreen'] = full_features['SmartScreen'].fillna('ExistsNotSet')
full_features['SMode'] = full_features['SMode'].fillna('0').astype('int8')
full_features['OrganizationIdentifier'] = full_features['OrganizationIdentifier'].fillna('0').astype('int32')
full_features['Wdft_IsGamer'] = full_features['Wdft_IsGamer'].fillna('0').astype('int8')
full_features['Wdft_RegionIdentifier'] = full_features['Wdft_RegionIdentifier'].fillna('0').astype('int32')
full_features['CityIdentifier'] = full_features['CityIdentifier'].fillna('0').astype('int32')

full_features = full_features.drop([
    'PuaMode',
    'Census_ProcessorClass',
    'DefaultBrowsersIdentifier',
    'Census_IsFlightingInternal',
    'Census_InternalBatteryType'], axis=1)

In [19]:
# Now let us check the string columns

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['ProductName',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'Platform',
 'Processor',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'IsProtected',
 'IeVerIdentifier',
 'SmartScreen',
 'Firewall',
 'UacLuaenable',
 'Census_MDC2FormFactor',
 'Census_DeviceFamily',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_PowerPlatformRoleName',
 'Census_InternalBatteryNumberOfCharges',
 'Census_OSVersion',
 'Census_OSArchitecture',
 'Cen

In [20]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,ProductName,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTypeName,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
0000028988387b115f69f31a3bf04f09,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,7,53447,1,1,29,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1,137.0,ExistsNotSet,1,1,Desktop,Windows.Desktop,2668,9124,4,5,2341,HDD,Desktop,18.9,1440,900,Desktop,4294967295,10.0.17134.165,amd64,rs4_release,17134,165,Professional,PROFESSIONAL,UUPUpgrade,26,119,UNKNOWN,IS_GENUINE,Retail,0,Retail,,628,36144,,0,0
000007535c3f730efa9ea0b7ef1bd645,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,7,53447,1,1,93,119,64,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1,137.0,ExistsNotSet,1,1,Notebook,Windows.Desktop,2668,91656,4,5,2405,HDD,Notebook,13.9,1366,768,Mobile,1,10.0.17134.1,amd64,rs4_release,17134,1,Professional,PROFESSIONAL,IBSClean,8,31,UNKNOWN,OFFLINE,Retail,0,NOT_SET,,628,57858,,0,0
000007905a28d863f6d0d597892cd692,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,7,53447,1,1,86,64,49,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1,137.0,RequireAdmin,1,1,Desktop,Windows.Desktop,4909,317701,4,5,1972,SSD,Desktop,21.5,1920,1080,Desktop,4294967295,10.0.17134.165,amd64,rs4_release,17134,165,Core,CORE,UUPUpgrade,7,30,FullAuto,IS_GENUINE,OEM:NONSLP,0,Retail,,142,52682,,0,0
00000b11598a75ea8ba1beea8459149f,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,7,53447,1,1,88,117,115,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1,137.0,ExistsNotSet,1,1,Desktop,Windows.Desktop,1443,275890,4,5,2273,UNKNOWN,MiniTower,18.5,1366,768,Desktop,4294967295,10.0.17134.228,amd64,rs4_release,17134,228,Professional,PROFESSIONAL,UUPUpgrade,17,64,FullAuto,IS_GENUINE,OEM:NONSLP,0,Retail,,355,20050,,0,0
000014a5f00daa18e76b81417eeb99fc,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,7,53447,1,1,18,277,75,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1,137.0,RequireAdmin,1,1,Notebook,Windows.Desktop,1443,331929,4,5,2500,HDD,Portable,14.0,1366,768,Mobile,0,10.0.17134.191,amd64,rs4_release,17134,191,Core,CORE,Update,8,31,FullAuto,IS_GENUINE,Retail,0,Retail,0.0,355,19844,0.0,0,0
000016191b897145d069102325cab760,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1094.0,7,53447,1,1,97,126,124,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1,137.0,RequireAdmin,1,1,Desktop,Windows.Desktop,3799,340727,2,5,4324,SSD,Desktop,21.5,1920,1080,Desktop,4294967295,10.0.17134.165,amd64,rs4_release,17134,165,Professional,PROFESSIONAL,UUPUpgrade,18,72,FullAuto,IS_GENUINE,Retail,0,Retail,0.0,93,51039,0.0,0,0
0000161e8abf8d8b89c5ab8787fd712b,win8defender,1.1.15100.1,4.18.1807.18075,1.273.845.0,7,43927,2,1,78,89,88,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1,137.0,ExistsNotSet,1,1,Notebook,Windows.Desktop,3799,207404,2,1,657,HDD,Notebook,17.2,1600,900,Mobile,0,10.0.17134.165,amd64,rs4_release,17134,165,Core,CORE,IBSClean,14,49,FullAuto,IS_GENUINE,Retail,0,Retail,,556,63175,,0,0
000019515bc8f95851aff6de873405e8,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1393.0,7,53447,1,1,97,126,124,windows10,x64,10.0.0.0,14393,768,rs1,14393.0.amd64fre.rs1_release.160715-1616,Home,1,94.0,RequireAdmin,1,1,Notebook,Windows.Desktop,5682,338896,2,5,3381,HDD,Notebook,15.5,1366,768,Mobile,0,10.0.14393.0,amd64,rs1_release,14393,0,Core,CORE,Upgrade,18,72,FullAuto,IS_GENUINE,Retail,0,Retail,0.0,512,63122,0.0,0,0
00001a027a0ab970c408182df8484fce,win8defender,1.1.15200.1,4.18.1807.18075,1.275.988.0,7,53447,1,1,164,205,172,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1,137.0,RequireAdmin,1,1,Notebook,Windows.Desktop,2206,240688,4,5,2836,HDD,Notebook,15.6,1920,1080,Mobile,0,10.0.17134.254,amd64,rs4_release,17134,254,Professional,PROFESSIONAL,Update,27,120,FullAuto,IS_GENUINE,Retail,0,Retail,0.0,500,15510,0.0,0,0
00001a18d69bb60bda9779408dcf02ac,win8defender,1.1.15100.1,4.18.1807.18075,1.273.973.0,7,46413,2,1,93,119,64,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.1805...,Home,1,,RequireAdmin,1,1,Notebook,Windows.Desktop,585,189457,4,5,2373,HDD,Notebook,15.5,1366,768,Mobile,0,10.0.16299.431,amd64,rs3_release_svc_escrow,16299,431,CoreSingleLanguage,CORE_SINGLELANGUAGE,Upgrade,8,31,UNKNOWN,IS_GENUINE,OEM:DM,0,Retail,0.0,556,63555,0.0,0,0


At first glance at the data, it becomes obvious, that the stings are either classifiers, or versions that contain 4 classifiers in them. So. in order to use the algorithms that support only numeric values we will convert classifiers like "ProductName" to integer range and the fields like AppVersion

In [21]:
def df_replacevalues(df, colname, oldvalues, newvalues):
    # First, we need to get the most frequent value of the column
    topvalue = df[colname].value_counts().idxmax()
    
    # Replace NaN values with the popular value
    df[colname].fillna(topvalue, inplace=True)
    
    # We need to make sure no other value than oldvalues exists
    indexes = df[~df[colname].isin(oldvalues)].index
    
    # If the "Garbage" values are more than 1%, then raise an error
    if len(indexes) > len(df) / 100:
        raise Exception("Not all neccessary values are present in oldvalues array")
    
    # Replace "Garbage" with the top value
    df.loc[indexes,[colname]] = topvalue
    
    print ("Previous values", df[colname].unique())
    df[colname] = pd.to_numeric(df[colname].replace(oldvalues, newvalues), errors='raise', downcast='integer')
    print ("New values", df[colname].unique())
    
#full_features["Platform"].unique()
#full_features["Platform"].value_counts()
#full_features[~full_features["ProductName"].isin(['win8defender', 'mse'])].index

In [22]:
colname = "ProductName"
oldvalues = ['win8defender','mse','mseprerelease','windowsintune','fep','scep']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['win8defender' 'mse' 'mseprerelease' 'windowsintune' 'fep' 'scep']
New values [1 2 3 4 5 6]


In [23]:
colname = "Platform"
oldvalues = ['windows10','windows7','windows8','windows2016']
newvalues = [10,7,8,2016]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['windows10' 'windows7' 'windows8' 'windows2016']
New values [  10    7    8 2016]


In [24]:
colname = "Processor"
oldvalues = ['x64','arm64','x86']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['x64' 'arm64' 'x86']
New values [1 2 3]


In [25]:
colname = "OsPlatformSubRelease"
oldvalues = ['rs4','rs1','rs3','windows7','windows8.1','th1','rs2','th2','prers5']
newvalues = [504,501,503,507,508,201,502,202,405]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['rs4' 'rs1' 'rs3' 'windows7' 'windows8.1' 'th1' 'rs2' 'th2' 'prers5']
New values [504 501 503 507 508 201 502 202 405]


In [26]:
colname = "SkuEdition"
oldvalues = ['Pro','Home','Invalid','Enterprise LTSB','Enterprise','Education','Cloud','Server']
newvalues = [55,52,0,71,70,20,90,80]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Pro' 'Home' 'Invalid' 'Enterprise LTSB' 'Enterprise' 'Education' 'Cloud'
 'Server']
New values [55 52  0 71 70 20 90 80]


In [27]:
colname = "SmartScreen"
oldvalues = ['Off','off','OFF','On','on','Warn','Prompt','ExistsNotSet','Block','RequireAdmin']
newvalues = [0,0,0,1,1,2,3,4,5,6]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['ExistsNotSet' 'RequireAdmin' 'Off' 'Warn' 'Prompt' 'Block' 'off' 'On'
 'on' 'OFF']
New values [4 6 0 2 3 5 1]


In [28]:
colname = "Census_MDC2FormFactor"
oldvalues = ['Desktop','Notebook','Detachable','PCOther','AllInOne','Convertible','SmallTablet','LargeTablet','SmallServer','LargeServer','MediumServer','ServerOther','IoTOther']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Desktop' 'Notebook' 'Detachable' 'PCOther' 'AllInOne' 'Convertible'
 'SmallTablet' 'LargeTablet' 'SmallServer' 'LargeServer' 'MediumServer'
 'ServerOther' 'IoTOther']
New values [ 1  2  3  4  5  6  7  8  9 10 11 12 13]


In [29]:
# Census_DeviceFamily ['Windows.Desktop' 'Windows.Server' 'Windows']

colname = "Census_DeviceFamily"
oldvalues = ['Windows.Desktop','Windows.Server','Windows']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Windows.Desktop' 'Windows.Server' 'Windows']
New values [1 2 3]


In [30]:
# Census_PrimaryDiskTypeName ['HDD' 'SSD' 'UNKNOWN' 'Unspecified' nan]

colname = "Census_PrimaryDiskTypeName"
oldvalues = ['HDD','SSD','UNKNOWN','Unspecified']
newvalues = [1,2,3,3]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['HDD' 'SSD' 'UNKNOWN' 'Unspecified']
New values [1 2 3]


In [31]:
# Census_ChassisTypeName Index(['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 'RackMountChassis', 'SubNotebook', 'BusExpansionChassis', '30', 'StickPC', '0', 'MultisystemChassis', 'Blade', '35', 'PizzaBox', 'SealedCasePC', 'SubChassis', 'ExpansionChassis', '31', '32', '88', '127', '25', '44', '36', 'DockingStation', 'BladeEnclosure', 'CompactPCI', '81', '45', 'EmbeddedPC', '28', '82', '112', 'IoTGateway', '49', '76', '39'], dtype='object')

colname = "Census_ChassisTypeName"
oldvalues = ['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 
             'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 
             'RackMountChassis', 'SubNotebook', 'BusExpansionChassis']
newvalues = [1,2,1,1,3,4,5,6,-1,7,8,9,10,11,12,-1,13,2,14,15,1,16]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Desktop' 'Notebook' 'MiniTower' 'Portable' 'Detachable' 'Laptop'
 'AllinOne' 'LowProfileDesktop' 'SpaceSaving' 'Other' 'Unknown' 'HandHeld'
 'UNKNOWN' 'Convertible' 'Tower' 'MainServerChassis' 'LunchBox'
 'SubNotebook' 'MiniPC' 'RackMountChassis' 'Tablet' 'BusExpansionChassis']
New values [ 2  1  4  7  3  8 10  6 -1  9  5 12 13 14 15 11 16]


In [32]:
# Census_PowerPlatformRoleName Index(['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']

colname = "Census_PowerPlatformRoleName"
full_features[colname] = full_features[colname].fillna('UNKNOWN')
oldvalues = ['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']
newvalues = [1,2,3,2,4,0,5,6,7,0]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Desktop' 'Mobile' 'Slate' 'Workstation' 'SOHOServer' 'UNKNOWN'
 'AppliancePC' 'EnterpriseServer' 'PerformanceServer' 'Unspecified']
New values [2 1 3 4 0 6 5 7]


In [33]:
# Census_OSArchitecture Index(['amd64', 'x86', 'arm64'], dtype='object')

colname = "Census_OSArchitecture"
oldvalues = ['amd64', 'x86', 'arm64']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['amd64' 'arm64' 'x86']
New values [1 2 3]


In [34]:
# Census_OSBranch Index(['rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 'rs2_release', 'rs1_release', 'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'rs5_release', 'rs3_release_svc_escrow_im', 'rs_prerelease', 'rs_prerelease_flt', 'rs5_release_sigma', 'rs1_release_srvmedia', 'winblue_ltsb_escrow', 'win7sp1_ldr', 'winblue_ltsb', 'win8_gdr', 'rs_xbox', 'rs5_release_edge', 'rs5_release_sigma_dev', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'rs_shell', 'rs1_release_svc', 'win8_ldr', 'rs_onecore_base_cobalt', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs3_release_svc', 'Khmer OS'], dtype='object')

colname = "Census_OSBranch"
oldvalues = ['rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 'rs2_release', 'rs1_release', 'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'rs5_release', 'rs3_release_svc_escrow_im', 'rs_prerelease', 'rs_prerelease_flt', 'rs5_release_sigma']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['rs4_release' 'rs1_release' 'rs3_release_svc_escrow' 'th2_release'
 'rs3_release' 'th1_st1' 'rs2_release' 'th1' 'rs3_release_svc_escrow_im'
 'th2_release_sec' 'rs5_release' 'rs_prerelease_flt' 'rs_prerelease'
 'rs5_release_sigma']
New values [ 1  5  3  6  2  8  4  9 11  7 10 13 12 14]


In [35]:
# Census_OSSkuName Index(['CORE', 'PROFESSIONAL', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'EDUCATION', 'ENTERPRISE', 'PROFESSIONAL_N', 'ENTERPRISE_S', 'STANDARD_SERVER', 'CLOUD', 'CORE_N', 'STANDARD_EVALUATION_SERVER', 'EDUCATION_N', 'ENTERPRISE_S_N', 'DATACENTER_EVALUATION_SERVER', 'SB_SOLUTION_SERVER', 'ENTERPRISE_N', 'PRO_WORKSTATION', 'UNLICENSED', 'DATACENTER_SERVER', 'PRO_WORKSTATION_N', 'CLOUDN', 'PRO_CHINA', 'SERVERRDSH', 'ULTIMATE', 'PRO_FOR_EDUCATION', 'PRO_SINGLE_LANGUAGE', 'UNDEFINED', 'STARTER', 'ENTERPRISEG'], dtype='object')

colname = "Census_OSSkuName"
oldvalues = ['CORE', 'PROFESSIONAL', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'EDUCATION', 'ENTERPRISE', 'PROFESSIONAL_N', 'ENTERPRISE_S', 'STANDARD_SERVER', 'CLOUD', 'CORE_N', 'STANDARD_EVALUATION_SERVER', 'EDUCATION_N', 'ENTERPRISE_S_N', 'DATACENTER_EVALUATION_SERVER', 'SB_SOLUTION_SERVER', 'ENTERPRISE_N', 'PRO_WORKSTATION', 'UNLICENSED']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['PROFESSIONAL' 'CORE' 'CORE_SINGLELANGUAGE' 'ENTERPRISE_S'
 'CORE_COUNTRYSPECIFIC' 'ENTERPRISE_S_N' 'ENTERPRISE' 'EDUCATION' 'CLOUD'
 'PROFESSIONAL_N' 'STANDARD_SERVER' 'CORE_N' 'STANDARD_EVALUATION_SERVER'
 'EDUCATION_N' 'DATACENTER_EVALUATION_SERVER' 'SB_SOLUTION_SERVER'
 'ENTERPRISE_N' 'PRO_WORKSTATION' 'UNLICENSED']
New values [ 2  1  3  8  4 14  6  5 10  7  9 11 12 13 15 16 17 18 19]


In [36]:
# Census_OSInstallTypeName Index(['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh'], dtype='object')

colname = "Census_OSInstallTypeName"
oldvalues = ['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['UUPUpgrade' 'IBSClean' 'Update' 'Upgrade' 'Other' 'Clean' 'Reset'
 'Refresh' 'CleanPCRefresh']
New values [1 2 3 4 5 8 6 7 9]


In [37]:
# Census_OSWUAutoUpdateOptionsName Index(['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify'], dtype='object')

colname = "Census_OSWUAutoUpdateOptionsName"
oldvalues = ['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['UNKNOWN' 'FullAuto' 'Notify' 'AutoInstallAndRebootAtMaintenanceTime'
 'Off' 'DownloadNotify']
New values [2 1 3 4 5 6]


In [38]:
# Census_GenuineStateName Index(['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED'], dtype='object')

colname = "Census_GenuineStateName"
oldvalues = ['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['IS_GENUINE' 'OFFLINE' 'INVALID_LICENSE' 'UNKNOWN' 'TAMPERED']
New values [1 3 2 4 5]


In [39]:
# Census_ActivationChannel Index(['Retail', 'OEM:DM', 'Volume:GVLK', 'OEM:NONSLP', 'Volume:MAK', 'Retail:TB:Eval'], dtype='object')

colname = "Census_ActivationChannel"
oldvalues = ['Retail', 'OEM:DM', 'Volume:GVLK', 'OEM:NONSLP', 'Volume:MAK', 'Retail:TB:Eval']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Retail' 'OEM:NONSLP' 'OEM:DM' 'Volume:GVLK' 'Volume:MAK'
 'Retail:TB:Eval']
New values [1 4 2 3 5 6]


In [40]:
# Census_FlightRing Index(['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary'], dtype='object')

colname = "Census_FlightRing"
oldvalues = ['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary']
newvalues = [1,2,0,3,4,5,0,0,0,0,0]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Retail' 'NOT_SET' 'Unknown' 'Disabled' 'RP' 'WIS' 'WIF' 'OSG' 'Canary'
 'Invalid']
New values [1 2 0 5 3 4]


In [41]:
# PuaMode Index(['off', 'on', 'audit'], dtype='object')

#colname = "PuaMode"
#oldvalues = ['off', 'on', 'audit']
#newvalues = [0,1,2]

#df_replacevalues(full_features, colname, oldvalues, newvalues)

In [42]:
# Census_OSEdition

colname = "Census_OSEdition"
oldvalues = ['Core','Professional','CoreSingleLanguage','CoreCountrySpecific','ProfessionalEducation','Education',
             'Enterprise','ProfessionalN','EnterpriseS','ServerStandard','Cloud','CoreN']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Professional' 'Core' 'CoreSingleLanguage' 'EnterpriseS'
 'CoreCountrySpecific' 'ProfessionalEducation' 'Enterprise' 'Education'
 'Cloud' 'ProfessionalN' 'ServerStandard' 'CoreN']
New values [ 2  1  3  9  4  5  7  6 11  8 10 12]


In [43]:
# Now let us check the string columns again

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsBuildLab',
 'IsProtected',
 'IeVerIdentifier',
 'Firewall',
 'UacLuaenable',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_InternalBatteryNumberOfCharges',
 'Census_OSVersion',
 'Census_OSBuildNumber',
 'Census_OSBuildRevision',
 'Census_OSInstallLanguageIdentifier',
 'Census_OSUILocaleIdentifier',
 'Census_IsFlightsDisabled',
 'Census_ThresholdOptIn',
 'Census_FirmwareManufacturerIdentifier',
 'Census_FirmwareVersionIdentifier',
 'Census_IsWIM

In [44]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,OsVer,OsBuild,OsSuite,OsBuildLab,IsProtected,IeVerIdentifier,Firewall,UacLuaenable,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_IsFlightsDisabled,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
0000028988387b115f69f31a3bf04f09,1.1.15100.1,4.18.1807.18075,1.273.1735.0,7,53447,1,1,29,35,171,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,2668,9124,4,5,2341,18.9,1440,900,4294967295,10.0.17134.165,17134,165,26,119,0,,628,36144,,0,0
000007535c3f730efa9ea0b7ef1bd645,1.1.14600.4,4.13.17134.1,1.263.48.0,7,53447,1,1,93,119,64,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,2668,91656,4,5,2405,13.9,1366,768,1,10.0.17134.1,17134,1,8,31,0,,628,57858,,0,0
000007905a28d863f6d0d597892cd692,1.1.15100.1,4.18.1807.18075,1.273.1341.0,7,53447,1,1,86,64,49,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,4909,317701,4,5,1972,21.5,1920,1080,4294967295,10.0.17134.165,17134,165,7,30,0,,142,52682,,0,0
00000b11598a75ea8ba1beea8459149f,1.1.15100.1,4.18.1807.18075,1.273.1527.0,7,53447,1,1,88,117,115,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,1443,275890,4,5,2273,18.5,1366,768,4294967295,10.0.17134.228,17134,228,17,64,0,,355,20050,,0,0
000014a5f00daa18e76b81417eeb99fc,1.1.15100.1,4.18.1807.18075,1.273.1379.0,7,53447,1,1,18,277,75,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,1443,331929,4,5,2500,14.0,1366,768,0,10.0.17134.191,17134,191,8,31,0,0.0,355,19844,0.0,0,0
000016191b897145d069102325cab760,1.1.15100.1,4.18.1807.18075,1.273.1094.0,7,53447,1,1,97,126,124,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,3799,340727,2,5,4324,21.5,1920,1080,4294967295,10.0.17134.165,17134,165,18,72,0,0.0,93,51039,0.0,0,0
0000161e8abf8d8b89c5ab8787fd712b,1.1.15100.1,4.18.1807.18075,1.273.845.0,7,43927,2,1,78,89,88,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,3799,207404,2,1,657,17.2,1600,900,0,10.0.17134.165,17134,165,14,49,0,,556,63175,,0,0
000019515bc8f95851aff6de873405e8,1.1.15100.1,4.18.1807.18075,1.273.1393.0,7,53447,1,1,97,126,124,10.0.0.0,14393,768,14393.0.amd64fre.rs1_release.160715-1616,1,94.0,1,1,5682,338896,2,5,3381,15.5,1366,768,0,10.0.14393.0,14393,0,18,72,0,0.0,512,63122,0.0,0,0
00001a027a0ab970c408182df8484fce,1.1.15200.1,4.18.1807.18075,1.275.988.0,7,53447,1,1,164,205,172,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137.0,1,1,2206,240688,4,5,2836,15.6,1920,1080,0,10.0.17134.254,17134,254,27,120,0,0.0,500,15510,0.0,0,0
00001a18d69bb60bda9779408dcf02ac,1.1.15100.1,4.18.1807.18075,1.273.973.0,7,46413,2,1,93,119,64,10.0.0.0,16299,768,16299.431.amd64fre.rs3_release_svc_escrow.1805...,1,,1,1,585,189457,4,5,2373,15.5,1366,768,0,10.0.16299.431,16299,431,8,31,0,0.0,556,63555,0.0,0,0


In [45]:
# Now we need to process the columns that contain version numbers
# We will split them in 4-5 different columns

versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']
newcolumnnames = []

for colname in versions:
    data = full_features[colname].str.split(r"\.|-",expand=True) # Split if '.' or '-'
    for i in range(data.shape[1]):
        newcolumnname = "%s_%d" % (colname, i+1)
        newcolumnnames.append(newcolumnname)
        full_features[newcolumnname] = data[i]

In [46]:
full_features[newcolumnnames].head(10)

Unnamed: 0_level_0,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
0000028988387b115f69f31a3bf04f09,1,1,15100,1,4,18,1807,18075,1,273,1735,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,165
000007535c3f730efa9ea0b7ef1bd645,1,1,14600,4,4,13,17134,1,1,263,48,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,1
000007905a28d863f6d0d597892cd692,1,1,15100,1,4,18,1807,18075,1,273,1341,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,165
00000b11598a75ea8ba1beea8459149f,1,1,15100,1,4,18,1807,18075,1,273,1527,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
000014a5f00daa18e76b81417eeb99fc,1,1,15100,1,4,18,1807,18075,1,273,1379,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,191
000016191b897145d069102325cab760,1,1,15100,1,4,18,1807,18075,1,273,1094,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,165
0000161e8abf8d8b89c5ab8787fd712b,1,1,15100,1,4,18,1807,18075,1,273,845,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,165
000019515bc8f95851aff6de873405e8,1,1,15100,1,4,18,1807,18075,1,273,1393,0,10,0,0,0,14393,0,amd64fre,rs1_release,160715,1616,10,0,14393,0
00001a027a0ab970c408182df8484fce,1,1,15200,1,4,18,1807,18075,1,275,988,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,254
00001a18d69bb60bda9779408dcf02ac,1,1,15100,1,4,18,1807,18075,1,273,973,0,10,0,0,0,16299,431,amd64fre,rs3_release_svc_escrow,180502,1908,10,0,16299,431


In [47]:
#colname = "OsBuildLab_4"
#print (full_features[colname].value_counts())
#print (colname, full_features[colname].value_counts().keys())

In [48]:
# After splitting the columns, the only values we need to remap are OsBuildLab_3 and OsBuildLab_4
# Other values are already numeric

# OsBuildLab_3 Index(['amd64fre', 'x86fre', 'arm64fre'], dtype='object')

colname = "OsBuildLab_3"
oldvalues = ['amd64fre', 'x86fre', 'arm64fre']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['amd64fre' 'arm64fre' 'x86fre']
New values [1 2 3]


In [49]:
# OsBuildLab_4 Index(['rs4_release', 'rs3_release_svc_escrow', 'rs3_release', 'rs2_release', 'rs1_release', 'th2_release_sec', 'th1', 'winblue_ltsb_escrow', 'th2_release', 'rs1_release_inmarket', 'winblue_ltsb', 'win7sp1_ldr', 'rs3_release_svc', 'rs1_release_1', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'th1_st1', 'rs5_release', 'rs1_release_inmarket_aim', 'rs3_release_svc_escrow_im', 'th2_release_inmarket', 'rs_prerelease', 'rs_prerelease_flt', 'win7sp1_gdr', 'winblue_gdr', 'th1_escrow', 'win7_gdr', 'winblue_r4', 'rs1_release_inmarket_rim', 'rs1_release_d', 'winblue_r9', 'winblue_r5', 'win7_rtm', 'win7sp1_rtm', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'rs5_release_sigma', 'win7_ldr', 'rs5_release_sigma_dev', 'rs_xbox', 'rs5_release_edge', 'winblue_rtm', 'win7sp1_rc', 'rs3_release_svc_sec', 'rs_onecore_base_cobalt', 'rs6_prerelease', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs_shell']

colname = "OsBuildLab_4"
oldvalues = ['rs4_release', 'rs3_release_svc_escrow', 'rs3_release', 'rs2_release', 'rs1_release', 'th2_release_sec', 'th1', 'winblue_ltsb_escrow', 'th2_release', 'rs1_release_inmarket', 'winblue_ltsb', 'win7sp1_ldr', 'rs3_release_svc', 'rs1_release_1', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'th1_st1', 'rs5_release', 'rs1_release_inmarket_aim', 'rs3_release_svc_escrow_im', 'th2_release_inmarket', 'rs_prerelease', 'rs_prerelease_flt', 'win7sp1_gdr', 'winblue_gdr', 'th1_escrow', 'win7_gdr', 'winblue_r4', 'rs1_release_inmarket_rim', 'rs1_release_d', 'winblue_r9', 'winblue_r5', 'win7_rtm', 'win7sp1_rtm', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'rs5_release_sigma', 'win7_ldr', 'rs5_release_sigma_dev', 'rs_xbox', 'rs5_release_edge', 'winblue_rtm', 'win7sp1_rc', 'rs3_release_svc_sec', 'rs_onecore_base_cobalt', 'rs6_prerelease', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs_shell']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['rs4_release' 'rs1_release' 'rs3_release_svc_escrow' 'win7sp1_gdr'
 'rs3_release' 'winblue_ltsb_escrow' 'th1' 'rs1_release_inmarket'
 'rs2_release' 'th2_release' 'winblue_ltsb' 'th2_release_sec'
 'rs3_release_svc_escrow_im' 'th1_st1' 'th2_release_inmarket'
 'rs1_release_sec' 'rs1_release_1' 'win7sp1_ldr' 'win7sp1_ldr_escrow'
 'rs3_release_svc' 'rs1_release_inmarket_aim' 'rs5_release'
 'rs_prerelease_flt' 'rs1_release_inmarket_rim' 'winblue_r7'
 'rs_prerelease' 'win7_gdr' 'rs5_release_sigma' 'winblue_gdr' 'winblue_r4'
 'win7sp1_rtm' 'rs1_release_d' 'th1_escrow' 'rs5_release_edge'
 'winblue_r8' 'win7_ldr' 'winblue_r5' 'win7_rtm' 'winblue_r9' 'winblue_r3'
 'rs3_release_svc_sec' 'rs5_release_sigma_dev' 'win7sp1_rc' 'rs_shell'
 'rs_onecore_stack_per1' 'rs_onecore_sigma_grfx_dev' 'winblue_rtm'
 'rs_onecore_base_cobalt' 'rs_xbox' 'rs6_prerelease' 'rs5_release_sign']
New values [ 1  5  2 24  3  8  7 10  4  9 11  6 20 17 21 16 14 12 15 13 19 18 23 29
 35 22 27 38 25 28 34 30 26

In [50]:
versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']

full_features = full_features.drop(versions, axis=1)

In [51]:
for colname in full_features.columns:
    if full_features[colname].dtypes.name not in ["int8","int16","int32"]:
        full_features[colname] = pd.to_numeric(full_features[colname], errors='coerce')
        topvalue = full_features[colname].value_counts().idxmax()
        full_features[colname].fillna(topvalue, inplace=True)

In [52]:
full_features.head(10)

Unnamed: 0_level_0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1
0000028988387b115f69f31a3bf04f09,1,0,7.0,0,53447.0,1.0,1.0,1,29,128035,18,35.0,171,10,1,17134,256,504,55,1.0,0,0,137.0,4,1.0,1.0,1,1,2668.0,9124.0,4.0,5.0,2341.0,476940.0,1,299451.0,0,4096.0,2,18.9,1440.0,900.0,2,4294967000.0,1,1,17134,165,2,2,1,26.0,119,2,0,1,1,0.0,1,0.0,628.0,36144.0,0,0.0,0.0,0,0,0.0,0,10,0,1,1,15100,1,4,18,1807,18075,1,273.0,1735,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,165
000007535c3f730efa9ea0b7ef1bd645,1,0,7.0,0,53447.0,1.0,1.0,1,93,1482,18,119.0,64,10,1,17134,256,504,55,1.0,0,0,137.0,4,1.0,1.0,2,1,2668.0,91656.0,4.0,5.0,2405.0,476940.0,1,102385.0,0,4096.0,1,13.9,1366.0,768.0,1,1.0,1,1,17134,1,2,2,2,8.0,31,2,0,3,1,0.0,2,0.0,628.0,57858.0,0,0.0,0.0,0,0,0.0,0,8,0,1,1,14600,4,4,13,17134,1,1,263.0,48,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,1
000007905a28d863f6d0d597892cd692,1,0,7.0,0,53447.0,1.0,1.0,1,86,153579,18,64.0,49,10,1,17134,768,504,52,1.0,0,0,137.0,6,1.0,1.0,1,1,4909.0,317701.0,4.0,5.0,1972.0,114473.0,2,113907.0,0,4096.0,2,21.5,1920.0,1080.0,2,4294967000.0,1,1,17134,165,1,1,1,7.0,30,1,0,1,4,0.0,1,0.0,142.0,52682.0,0,0.0,0.0,0,0,0.0,0,3,0,1,1,15100,1,4,18,1807,18075,1,273.0,1341,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,165
00000b11598a75ea8ba1beea8459149f,1,0,7.0,0,53447.0,1.0,1.0,1,88,20710,0,117.0,115,10,1,17134,256,504,55,1.0,0,0,137.0,4,1.0,1.0,1,1,1443.0,275890.0,4.0,5.0,2273.0,238475.0,3,227116.0,0,4096.0,4,18.5,1366.0,768.0,2,4294967000.0,1,1,17134,228,2,2,1,17.0,64,1,0,1,4,0.0,1,0.0,355.0,20050.0,0,0.0,0.0,0,0,0.0,0,3,1,1,1,15100,1,4,18,1807,18075,1,273.0,1527,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,228
000014a5f00daa18e76b81417eeb99fc,1,0,7.0,0,53447.0,1.0,1.0,1,18,37376,0,277.0,75,10,1,17134,768,504,52,1.0,0,0,137.0,6,1.0,1.0,2,1,1443.0,331929.0,4.0,5.0,2500.0,476940.0,1,101900.0,0,6144.0,1,14.0,1366.0,768.0,1,0.0,1,1,17134,191,1,1,3,8.0,31,1,0,1,1,0.0,1,0.0,355.0,19844.0,0,0.0,0.0,0,0,0.0,0,1,1,1,1,15100,1,4,18,1807,18075,1,273.0,1379,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,191
000016191b897145d069102325cab760,1,0,7.0,0,53447.0,1.0,1.0,1,97,13598,27,126.0,124,10,1,17134,256,504,55,1.0,0,0,137.0,6,1.0,1.0,1,1,3799.0,340727.0,2.0,5.0,4324.0,114473.0,2,113671.0,0,8192.0,2,21.5,1920.0,1080.0,2,4294967000.0,1,1,17134,165,2,2,1,18.0,72,1,0,1,1,0.0,1,0.0,93.0,51039.0,0,0.0,0.0,0,0,0.0,0,15,1,1,1,15100,1,4,18,1807,18075,1,273.0,1094,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,165
0000161e8abf8d8b89c5ab8787fd712b,1,0,7.0,0,43927.0,2.0,1.0,1,78,81215,0,89.0,88,10,1,17134,768,504,52,1.0,0,0,137.0,4,1.0,1.0,2,1,3799.0,207404.0,2.0,1.0,657.0,476940.0,1,458702.0,0,4096.0,1,17.2,1600.0,900.0,1,0.0,1,1,17134,165,1,1,2,14.0,49,1,0,1,1,0.0,1,0.0,556.0,63175.0,1,0.0,0.0,0,0,0.0,0,10,1,1,1,15100,1,4,18,1807,18075,1,273.0,845,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,165
000019515bc8f95851aff6de873405e8,1,0,7.0,0,53447.0,1.0,1.0,1,97,150323,27,126.0,124,10,1,14393,768,501,52,1.0,0,0,94.0,6,1.0,1.0,2,1,5682.0,338896.0,2.0,5.0,3381.0,305245.0,1,290807.0,1,4096.0,1,15.5,1366.0,768.0,1,0.0,1,5,14393,0,1,1,4,18.0,72,1,0,1,1,0.0,1,0.0,512.0,63122.0,0,0.0,0.0,0,0,0.0,0,15,0,1,1,15100,1,4,18,1807,18075,1,273.0,1393,0,10,0,0,0,14393.0,0.0,1,5,160715.0,1616.0,10,0,14393,0
00001a027a0ab970c408182df8484fce,1,0,7.0,0,53447.0,1.0,1.0,1,164,155006,27,205.0,172,10,1,17134,256,504,55,1.0,0,0,137.0,6,1.0,1.0,2,1,2206.0,240688.0,4.0,5.0,2836.0,305245.0,1,303892.0,0,4096.0,1,15.6,1920.0,1080.0,1,0.0,1,1,17134,254,2,2,3,27.0,120,1,0,1,1,0.0,1,0.0,500.0,15510.0,0,0.0,0.0,0,0,0.0,0,15,0,1,1,15200,1,4,18,1807,18075,1,275.0,988,0,10,0,0,0,17134.0,1.0,1,1,180410.0,1804.0,10,0,17134,254
00001a18d69bb60bda9779408dcf02ac,1,0,7.0,0,46413.0,2.0,1.0,1,93,98572,27,119.0,64,10,1,16299,768,503,52,1.0,0,0,137.0,6,1.0,1.0,2,1,585.0,189457.0,4.0,5.0,2373.0,953869.0,1,203252.0,1,8192.0,1,15.5,1366.0,768.0,1,0.0,1,3,16299,431,3,3,4,8.0,31,2,0,1,2,0.0,1,0.0,556.0,63555.0,1,0.0,0.0,0,0,0.0,1,8,1,1,1,15100,1,4,18,1807,18075,1,273.0,973,0,10,0,0,0,16299.0,431.0,1,2,180502.0,1908.0,10,0,16299,431


In [53]:
# Let's see some details of the loaded data
full_features.describe()

Unnamed: 0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
count,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0,8921483.0
mean,1.010664,7.509962e-06,6.845893,0.01733378,47862.77,1.325452,1.020882,0.9879711,108.049,78302.35,17.19621,169.6641,122.8161,13.15615,1.182901,15719.97,575.1534,480.0748,52.63172,0.9458434,2.891896e-05,0.0004350174,126.6448,4.851504,0.9788018,13.01312,2.199818,1.00162,2224.959,239995.5,3.989743,4.533721,2372.781,3073530.0,1.418861,375295.9,0.07718728,6097.033,1.548219,16.66998,1546.759,896.8883,1.370257,1089928000.0,1.1828,2.642073,15834.83,973.049,1.978409,1.952274,2.94429,14.56586,60.46534,1.883548,0.0005452008,1.145706,1.596226,9.863831e-06,1.014833,9.146461e-05,397.5212,33029.31,0.4860229,1.12089e-07,0.007026859,0.1255431,0.03807091,0.05696004,0.273933,7.615417,0.4997927,1.0,1.0,15074.67,1.297016,4.0,15.86526,5645.272,14097.16,0.9999924,272.3574,934.7381,0.0,9.870695,0.07593054,0.01082477,0.0006018058,15719.97,1421.471,1.182901,3.077122,176502.5,1777.148,9.999991,5.156093e-06,15834.83,973.0486
std,0.1030851,0.002740421,1.024237,0.1305118,14008.39,0.5222781,0.1672192,0.1090149,63.04706,50381.52,12.39369,89.31861,69.32125,80.45037,0.5764641,2190.685,248.0847,80.25445,5.872511,0.2263264,0.005377558,0.02085253,42.54505,1.271022,0.1440444,9861.775,1.319161,0.04026843,1309.464,71971.89,2.077727,1.284258,838.9427,4438388000.0,0.6211084,326013.6,0.2668884,5096.258,1.485853,5.877963,367.6356,214.2633,0.6265292,1869027000.0,0.576321,2.025068,1961.743,2931.971,1.148644,1.090706,1.817188,10.1801,44.99992,0.9366091,0.02334317,0.4300829,0.7579678,0.003140658,0.3043268,0.009563277,222.4636,21015.73,0.4998046,0.0003347969,0.08353133,0.3313338,0.1913675,0.2317663,0.4459751,4.694833,0.5,0.0,0.0,276.7845,1.022552,0.0,3.323923,6255.472,7375.821,0.002760796,6.047073,532.7531,0.0,0.7074591,0.4483061,0.1453866,0.2306252,2190.676,4610.761,0.5764634,3.127958,6042.845,210.9401,0.005989023,0.003667513,1961.743,2931.971
min,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,7.0,1.0,7600.0,16.0,201.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,0.0,0.0,255.0,-1.0,0.7,-1.0,-1.0,0.0,0.0,1.0,1.0,7600.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,9700.0,0.0,4.0,4.0,203.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,7600.0,0.0,1.0,1.0,90713.0,13.0,6.0,0.0,7600.0,0.0
25%,1.0,0.0,7.0,0.0,49480.0,1.0,1.0,1.0,51.0,31276.0,0.0,89.0,74.0,10.0,1.0,15063.0,256.0,503.0,52.0,1.0,0.0,0.0,111.0,4.0,1.0,1.0,2.0,1.0,1443.0,189819.0,2.0,5.0,1998.0,244196.0,1.0,120488.0,0.0,4096.0,1.0,13.9,1366.0,768.0,1.0,0.0,1.0,1.0,15063.0,167.0,1.0,1.0,1.0,8.0,31.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,142.0,13299.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,1.0,1.0,15100.0,1.0,4.0,13.0,1807.0,17443.0,1.0,273.0,497.0,0.0,10.0,0.0,0.0,0.0,15063.0,1.0,1.0,1.0,170928.0,1804.0,10.0,0.0,15063.0,167.0
50%,1.0,0.0,7.0,0.0,53447.0,1.0,1.0,1.0,97.0,77866.0,18.0,181.0,88.0,10.0,1.0,16299.0,768.0,503.0,52.0,1.0,0.0,0.0,135.0,4.0,1.0,1.0,2.0,1.0,2102.0,248045.0,4.0,5.0,2503.0,476940.0,1.0,248977.0,0.0,4096.0,1.0,15.5,1366.0,768.0,1.0,0.0,1.0,2.0,16299.0,285.0,2.0,2.0,3.0,9.0,34.0,2.0,0.0,1.0,1.0,0.0,1.0,0.0,486.0,33075.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,1.0,1.0,15100.0,1.0,4.0,18.0,1807.0,18075.0,1.0,273.0,948.0,0.0,10.0,0.0,0.0,0.0,16299.0,1.0,1.0,2.0,180410.0,1804.0,10.0,0.0,16299.0,285.0
75%,1.0,0.0,7.0,0.0,53447.0,2.0,1.0,1.0,162.0,121270.0,27.0,267.0,182.0,10.0,1.0,17134.0,768.0,504.0,55.0,1.0,0.0,0.0,137.0,6.0,1.0,1.0,2.0,1.0,2668.0,308607.0,4.0,5.0,2868.0,953869.0,2.0,475965.0,0.0,8192.0,2.0,17.2,1920.0,1080.0,2.0,4294967000.0,1.0,4.0,17134.0,547.0,3.0,3.0,4.0,20.0,90.0,3.0,0.0,1.0,2.0,0.0,1.0,0.0,556.0,52250.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,1.0,1.0,1.0,15200.0,1.0,4.0,18.0,10586.0,18075.0,1.0,275.0,1379.0,0.0,10.0,0.0,0.0,0.0,17134.0,431.0,1.0,4.0,180410.0,1834.0,10.0,0.0,17134.0,547.0
max,6.0,1.0,35.0,1.0,70507.0,7.0,5.0,1.0,222.0,167962.0,52.0,296.0,283.0,2016.0,3.0,18244.0,784.0,508.0,90.0,1.0,1.0,1.0,429.0,6.0,1.0,16777220.0,13.0,3.0,6145.0,345498.0,192.0,10.0,4479.0,8160437000000.0,3.0,47687100.0,1.0,1572864.0,16.0,182.3,12288.0,8640.0,7.0,4294967000.0,3.0,14.0,18244.0,41736.0,12.0,19.0,9.0,39.0,162.0,6.0,1.0,5.0,6.0,1.0,5.0,1.0,1092.0,72105.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,15.0,1.0,1.0,1.0,15300.0,6.0,4.0,18.0,17686.0,20082.0,1.0,277.0,7751.0,0.0,10.0,3.0,153.0,153.0,18244.0,24236.0,3.0,51.0,180918.0,2340.0,10.0,3.0,18244.0,41736.0


In [54]:
full_features['UacLuaenable'].unique()

array([1.0000000e+00, 0.0000000e+00, 4.8000000e+01, 3.0000000e+00,
       2.0000000e+00, 6.3570620e+06, 4.9000000e+01, 1.6777216e+07,
       5.0000000e+00, 2.5500000e+02, 7.7988840e+06])

In [55]:
#[] (180000, 97) (20000, 97) (180000,) (20000,) AdaBoostClassifier 61.675000000000004
#['AvSigVersion_1', 'AvSigVersion_2', 'AvSigVersion_3', 'AvSigVersion_4'] (180000, 93) (20000, 93) (180000,) (20000,) AdaBoostClassifier 61.46
#['Census_InternalPrimaryDiagonalDisplaySizeInInches'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.695
#['Census_OSEdition'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.4
#['Census_PrimaryDiskTotalCapacity'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.58
#['Census_SystemVolumeTotalCapacity'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.77
#['Census_TotalPhysicalRAM'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.150000000000006
#['IsBeta'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.0
#['AutoSampleOptIn'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.9
#['LocaleEnglishNameIdentifier'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.61
#['Census_IsFlightsDisabled'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.53999999999999
#['Census_FirmwareManufacturerIdentifier'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.08
#['Census_FirmwareVersionIdentifier'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.370000000000005
#['Census_IsVirtualDevice'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.605
#['Census_IsAlwaysOnAlwaysConnectedCapable'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.745000000000005
#['Census_ThresholdOptIn'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.32
#['Census_IsWIMBootEnabled'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.775000000000006
#['Census_InternalBatteryNumberOfCharges'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.35
#['Census_OSSkuName'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.535
#['Census_ChassisTypeName'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.56
#['Census_OSBranch'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.5
#['Census_OSBuildNumber'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.845000000000006
#['Census_OSBuildRevision'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.529999999999994
#['Census_OSArchitecture'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.515
#['OsBuild'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.28
#['ProductName'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.660000000000004

In [56]:
# ProductName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.843
# IsBeta (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.82299999999999
# RtpStateBitfield (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.1
# IsSxsPassiveMode (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.007999999999996
# AVProductStatesIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.812
# AVProductsInstalled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.739999999999995
# AVProductsEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.726000000000006
# HasTpm (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.788
# CountryIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.824999999999996
# CityIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.033
# OrganizationIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.937
# GeoNameIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.17
# LocaleEnglishNameIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.695
# Platform (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.032
# Processor (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.975
# OsBuild (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.855000000000004
# OsSuite (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.788
# OsPlatformSubRelease (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.868
# SkuEdition (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.836999999999996
# IsProtected (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.912
# AutoSampleOptIn (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.805
# PuaMode (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.992000000000004
# SMode (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.044
# IeVerIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.91
# SmartScreen (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.769
# Firewall (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.061
# UacLuaenable (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.867000000000004
# Census_MDC2FormFactor (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.064
# Census_DeviceFamily (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.235
# Census_OEMNameIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.188
# Census_OEMModelIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.931999999999995
# Census_ProcessorCoreCount (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.60099999999999
# Census_ProcessorManufacturerIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.711000000000006
# Census_ProcessorModelIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.089000000000006
# Census_PrimaryDiskTotalCapacity (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.877
# Census_PrimaryDiskTypeName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.064
# Census_SystemVolumeTotalCapacity (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.244
# Census_HasOpticalDiskDrive (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.875
# Census_TotalPhysicalRAM (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.589000000000006
# Census_ChassisTypeName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.017999999999994
# Census_InternalPrimaryDiagonalDisplaySizeInInches (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.071
# Census_InternalPrimaryDisplayResolutionHorizontal (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.791
# Census_InternalPrimaryDisplayResolutionVertical (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.002
# Census_PowerPlatformRoleName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.772000000000006
# Census_InternalBatteryNumberOfCharges (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.058
# Census_OSArchitecture (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.68
# Census_OSBranch (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.763
# Census_OSBuildNumber (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.09400000000001
# Census_OSBuildRevision (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.088
# Census_OSEdition (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.94
# Census_OSSkuName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.002
# Census_OSInstallTypeName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.624
# Census_OSInstallLanguageIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.745000000000005
# Census_OSUILocaleIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.869
# Census_OSWUAutoUpdateOptionsName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.265
# Census_IsPortableOperatingSystem (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.085
# Census_GenuineStateName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.782000000000004
# Census_ActivationChannel (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.992000000000004
# Census_IsFlightsDisabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.19
# Census_FlightRing (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.839
# Census_ThresholdOptIn (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.91799999999999
# Census_FirmwareManufacturerIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.086
# Census_FirmwareVersionIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.980000000000004
# Census_IsSecureBootEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.899
# Census_IsWIMBootEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.181
# Census_IsVirtualDevice (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.017
# Census_IsTouchEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.849
# Census_IsPenCapable (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.948
# Census_IsAlwaysOnAlwaysConnectedCapable (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.012
# Wdft_IsGamer (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.692
# Wdft_RegionIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.754
# EngineVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.983999999999995
# EngineVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.927
# EngineVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.714999999999996
# EngineVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.184
# AppVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.894000000000005
# AppVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.035
# AppVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.913
# AppVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.929
# AvSigVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.687000000000005
# AvSigVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.013
# AvSigVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.944
# AvSigVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.809000000000005
# OsVer_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.743
# OsVer_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.92100000000001
# OsVer_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.056999999999995
# OsVer_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.97899999999999
# OsBuildLab_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.166999999999994
# OsBuildLab_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.893
# OsBuildLab_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.076
# OsBuildLab_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.063
# OsBuildLab_5 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.016000000000005
# OsBuildLab_6 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.288
# Census_OSVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.991
# Census_OSVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.041000000000004
# Census_OSVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.946999999999996
# Census_OSVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.86000000000001

In [57]:
# OsBuildLab_6 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.288
# Census_OSWUAutoUpdateOptionsName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.265
# Census_SystemVolumeTotalCapacity (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.244
# Census_DeviceFamily (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.235

In [58]:
full_features.to_csv('./csv/train_v6.csv')

In [529]:
train_count = 250000 #int(len(full_features) * 0.8)

train_features = full_features.values[:train_count]
test_features  = full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print ("All columns (normalized)", train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", all_columns_score*100)


All columns (normalized) (250000, 96) (250000, 96) (250000,) (250000,) HistGradientBoostingClassifier 64.1392


In [530]:
model = PCA(n_components=80)
pca_train_results = np.array(model.fit_transform(normalized_train_features))
pca_test_results = np.array(model.transform(normalized_test_features))

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(pca_train_results, train_labels)
pca_all_columns_score = clf.score(pca_test_results, test_labels)
    
print ("All columns (PCA)", train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", pca_all_columns_score*100)


All columns (PCA) (250000, 96) (250000, 96) (250000,) (250000,) HistGradientBoostingClassifier 61.8156


In [563]:
columns_to_drop = [
    #'Census_OSArchitecture',
    'GeoNameIdentifier',
    'UacLuaenable',
    'Census_FirmwareVersionIdentifier',
    'IsProtected',
    #'OsSuite',
    'CityIdentifier',
    'Census_OEMModelIdentifier',
    
    # SAME
    'Census_ThresholdOptIn',
    'AutoSampleOptIn',
    'Census_IsFlightsDisabled',
    'IsBeta',
    'ProductName',
    'Census_IsWIMBootEnabled',
    'Census_DeviceFamily',
    'Census_OSBuildNumber',
    'Census_OSBuildRevision',
    'Platform',
    'Processor',
    'Census_IsPortableOperatingSystem',
    'Census_IsPenCapable',
    'OsBuild',
    'Census_ProcessorManufacturerIdentifier',
    'OsVer_1',
    'OsVer_2',
    'OsVer_3',
    'OsVer_4'
]

df_full_features = full_features.drop(columns_to_drop, axis=1)

train_features = df_full_features.values[:train_count]
test_features  = df_full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)
    
clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print (columns_to_drop)
print (train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", all_columns_score*100)

['GeoNameIdentifier', 'UacLuaenable', 'Census_FirmwareVersionIdentifier', 'IsProtected', 'CityIdentifier', 'Census_OEMModelIdentifier', 'Census_ThresholdOptIn', 'AutoSampleOptIn', 'Census_IsFlightsDisabled', 'IsBeta', 'ProductName', 'Census_IsWIMBootEnabled', 'Census_DeviceFamily', 'Census_OSBuildNumber', 'Census_OSBuildRevision', 'Platform', 'Processor', 'Census_IsPortableOperatingSystem', 'Census_IsPenCapable', 'OsBuild', 'Census_ProcessorManufacturerIdentifier', 'OsVer_1', 'OsVer_2', 'OsVer_3', 'OsVer_4']
(250000, 71) (250000, 71) (250000,) (250000,) HistGradientBoostingClassifier 64.1988


In [564]:
cols = [
    'AutoSampleOptIn',
    'Census_ProcessorCoreCount',
    'Census_IsFlightsDisabled',
    'IsBeta',
    'ProductName',
    'UacLuaenable',
    'Census_OEMNameIdentifier',
    'IsSxsPassiveMode',
    'OsBuildLab_6',
    'Census_OSArchitecture',
    'GeoNameIdentifier',
    'AVProductsInstalled',
    'Census_SystemVolumeTotalCapacity',
    'Census_DeviceFamily',
    'Census_InternalPrimaryDiagonalDisplaySizeInInches',
    'Census_OSEdition',
    'Census_PrimaryDiskTotalCapacity',
    'Census_SystemVolumeTotalCapacity',
    'Census_TotalPhysicalRAM',
    'LocaleEnglishNameIdentifier',
    'Census_FirmwareManufacturerIdentifier',
    'Census_FirmwareVersionIdentifier',
    'Census_IsAlwaysOnAlwaysConnectedCapable',
    'Census_IsWIMBootEnabled',
    'Census_InternalBatteryNumberOfCharges',
    'Census_OSSkuName',
    'Census_ChassisTypeName',
    'Census_OSBranch',
    'Census_OSBuildNumber',
    'Census_OSBuildRevision',
]

for c in cols:
    if c not in df_full_features.columns:
        continue
        
    df_features = df_full_features.drop(c, axis=1)

    train_features = df_features.values[:train_count]
    test_features  = df_features.values[train_count:]

    train_labels = full_labels.values[:train_count]
    test_labels = full_labels.values[train_count:]
    
    scaler = StandardScaler()
    scaler.fit(train_features)
    normalized_train_features = scaler.transform(train_features)
    normalized_test_features = scaler.transform(test_features)
    
    clf = ske.HistGradientBoostingClassifier(random_state=123)
    clf.fit(normalized_train_features, train_labels)
    score = clf.score(normalized_test_features, test_labels)
    
    print (c, train_features.shape, "HistGradientBoosting", score*100, score >= all_columns_score, score > all_columns_score)

Census_ProcessorCoreCount (250000, 70) HistGradientBoosting 64.1588 False False
Census_OEMNameIdentifier (250000, 70) HistGradientBoosting 64.1216 False False
IsSxsPassiveMode (250000, 70) HistGradientBoosting 64.2092 True True
OsBuildLab_6 (250000, 70) HistGradientBoosting 64.1352 False False
Census_OSArchitecture (250000, 70) HistGradientBoosting 64.1732 False False
AVProductsInstalled (250000, 70) HistGradientBoosting 64.0532 False False
Census_SystemVolumeTotalCapacity (250000, 70) HistGradientBoosting 64.066 False False
Census_InternalPrimaryDiagonalDisplaySizeInInches (250000, 70) HistGradientBoosting 64.0476 False False
Census_OSEdition (250000, 70) HistGradientBoosting 64.14760000000001 False False
Census_PrimaryDiskTotalCapacity (250000, 70) HistGradientBoosting 64.1688 False False
Census_SystemVolumeTotalCapacity (250000, 70) HistGradientBoosting 64.066 False False
Census_TotalPhysicalRAM (250000, 70) HistGradientBoosting 64.054 False False
LocaleEnglishNameIdentifier (250000

In [565]:
for c in df_full_features.columns:
    if c in cols:
        continue
    
    df_features = df_full_features.drop(c, axis=1)

    train_features = df_features.values[:train_count]
    test_features  = df_features.values[train_count:]

    train_labels = full_labels.values[:train_count]
    test_labels = full_labels.values[train_count:]
    
    scaler = StandardScaler()
    scaler.fit(train_features)
    normalized_train_features = scaler.transform(train_features)
    normalized_test_features = scaler.transform(test_features)
    
    clf = ske.HistGradientBoostingClassifier(random_state=123)
    clf.fit(normalized_train_features, train_labels)
    score = clf.score(normalized_test_features, test_labels)
    
    print (c, train_features.shape, "HistGradientBoosting", score*100, score >= all_columns_score, score > all_columns_score)

RtpStateBitfield (250000, 70) HistGradientBoosting 64.09280000000001 False False
AVProductStatesIdentifier (250000, 70) HistGradientBoosting 63.6484 False False
AVProductsEnabled (250000, 70) HistGradientBoosting 64.1828 False False
HasTpm (250000, 70) HistGradientBoosting 64.1164 False False
CountryIdentifier (250000, 70) HistGradientBoosting 64.0564 False False
OrganizationIdentifier (250000, 70) HistGradientBoosting 64.10679999999999 False False
OsSuite (250000, 70) HistGradientBoosting 64.17399999999999 False False
OsPlatformSubRelease (250000, 70) HistGradientBoosting 64.1584 False False
SkuEdition (250000, 70) HistGradientBoosting 64.1712 False False
SMode (250000, 70) HistGradientBoosting 64.14800000000001 False False
IeVerIdentifier (250000, 70) HistGradientBoosting 64.1132 False False
SmartScreen (250000, 70) HistGradientBoosting 63.1328 False False
Firewall (250000, 70) HistGradientBoosting 64.12440000000001 False False
Census_MDC2FormFactor (250000, 70) HistGradientBoosting 