The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.

In [404]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.experimental import enable_hist_gradient_boosting
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [405]:
# set up display area to show dataframe in jupyter qtconsole
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Columns

Unavailable or self-documenting column names are marked with an "NA".

    MachineIdentifier - Individual machine ID
    ProductName - Defender state information e.g. win8defender
    EngineVersion - Defender state information e.g. 1.1.12603.0
    AppVersion - Defender state information e.g. 4.9.10586.0
    AvSigVersion - Defender state information e.g. 1.217.1014.0
    IsBeta - Defender state information e.g. false
    RtpStateBitfield - NA
    IsSxsPassiveMode - NA
    DefaultBrowsersIdentifier - ID for the machine's default browser
    AVProductStatesIdentifier - ID for the specific configuration of a user's antivirus software
    AVProductsInstalled - NA
    AVProductsEnabled - NA
    HasTpm - True if machine has tpm
    CountryIdentifier - ID for the country the machine is located in
    CityIdentifier - ID for the city the machine is located in
    OrganizationIdentifier - ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries
    GeoNameIdentifier - ID for the geographic region a machine is located in
    LocaleEnglishNameIdentifier - English name of Locale ID of the current user
    Platform - Calculates platform name (of OS related properties and processor property)
    Processor - This is the process architecture of the installed operating system
    OsVer - Version of the current operating system
    OsBuild - Build of the current operating system
    OsSuite - Product suite mask for the current operating system.
    OsPlatformSubRelease - Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)
    OsBuildLab - Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022
    SkuEdition - The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.
    IsProtected - This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.
    AutoSampleOptIn - This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+
    PuaMode - Pua Enabled mode from the service
    SMode - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed
    IeVerIdentifier - NA
    SmartScreen - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.
    Firewall - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.
    UacLuaenable - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.
    Census_MDC2FormFactor - A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)
    Census_DeviceFamily - AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone
    Census_OEMNameIdentifier - NA
    Census_OEMModelIdentifier - NA
    Census_ProcessorCoreCount - Number of logical cores in the processor
    Census_ProcessorManufacturerIdentifier - NA
    Census_ProcessorModelIdentifier - NA
    Census_ProcessorClass - A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated
    Census_PrimaryDiskTotalCapacity - Amount of disk space on primary disk of the machine in MB
    Census_PrimaryDiskTypeName - Friendly name of Primary Disk Type - HDD or SSD
    Census_SystemVolumeTotalCapacity - The size of the partition that the System volume is installed on in MB
    Census_HasOpticalDiskDrive - True indicates that the machine has an optical disk drive (CD/DVD)
    Census_TotalPhysicalRAM - Retrieves the physical RAM in MB
    Census_ChassisTypeName - Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx
    Census_InternalPrimaryDiagonalDisplaySizeInInches - Retrieves the physical diagonal length in inches of the primary display
    Census_InternalPrimaryDisplayResolutionHorizontal - Retrieves the number of pixels in the horizontal direction of the internal display.
    Census_InternalPrimaryDisplayResolutionVertical - Retrieves the number of pixels in the vertical direction of the internal display
    Census_PowerPlatformRoleName - Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device
    Census_InternalBatteryType - NA
    Census_InternalBatteryNumberOfCharges - NA
    Census_OSVersion - Numeric OS version Example - 10.0.10130.0
    Census_OSArchitecture - Architecture on which the OS is based. Derived from OSVersionFull. Example - amd64
    Census_OSBranch - Branch of the OS extracted from the OsVersionFull. Example - OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005
    Census_OSBuildNumber - OS Build number extracted from the OsVersionFull. Example - OsBuildNumber = 10512 or 10240
    Census_OSBuildRevision - OS Build revision extracted from the OsVersionFull. Example - OsBuildRevision = 1000 or 16458
    Census_OSEdition - Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise
    Census_OSSkuName - OS edition friendly name (currently Windows only)
    Census_OSInstallTypeName - Friendly description of what install was used on the machine i.e. clean
    Census_OSInstallLanguageIdentifier - NA
    Census_OSUILocaleIdentifier - NA
    Census_OSWUAutoUpdateOptionsName - Friendly name of the WindowsUpdate auto-update settings on the machine.
    Census_IsPortableOperatingSystem - Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.
    Census_GenuineStateName - Friendly name of OSGenuineStateID. 0 = Genuine
    Census_ActivationChannel - Retail license key or Volume license key for a machine.
    Census_IsFlightingInternal - NA
    Census_IsFlightsDisabled - Indicates if the machine is participating in flighting.
    Census_FlightRing - The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring.
    Census_ThresholdOptIn - NA
    Census_FirmwareManufacturerIdentifier - NA
    Census_FirmwareVersionIdentifier - NA
    Census_IsSecureBootEnabled - Indicates if Secure Boot mode is enabled.
    Census_IsWIMBootEnabled - NA
    Census_IsVirtualDevice - Identifies a Virtual Machine (machine learning model)
    Census_IsTouchEnabled - Is this a touch device ?
    Census_IsPenCapable - Is the device capable of pen input ?
    Census_IsAlwaysOnAlwaysConnectedCapable - Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected .
    Wdft_IsGamer - Indicates whether the device is a gamer device or not based on its hardware combination.
    Wdft_RegionIdentifier - NA


In [406]:
# We need to explicitly specify data types when reading csv, otherwise it is very memory consuming
# and we will get the warning "Specify dtype option on import or set low_memory=False"
# So, we will manually defined the data types

# P.S. I have loaded the sample data and exported train_data.dtypes
# these are the data types for fast loading

datatypes = {
    'ProductName': str,
    'EngineVersion': str,
    'AppVersion': str,
    'AvSigVersion': str,
    'IsBeta': np.int8,
    'RtpStateBitfield': str,
    'IsSxsPassiveMode': np.int8,
    'DefaultBrowsersIdentifier': str,
    'AVProductStatesIdentifier': str,
    'AVProductsInstalled': str,
    'AVProductsEnabled': str,
    'HasTpm': np.int8,
    'CountryIdentifier': str,
    'CityIdentifier': str,
    'OrganizationIdentifier': str,
    'GeoNameIdentifier': str,
    'LocaleEnglishNameIdentifier': str,
    'Platform': str,
    'Processor': str,
    'OsVer': str,
    'OsBuild': str,
    'OsSuite': str,
    'OsPlatformSubRelease': str,
    'OsBuildLab': str,
    'SkuEdition': str,
    'IsProtected': str,
    'AutoSampleOptIn': np.int8,
    'PuaMode': str,
    'SMode': str,
    'IeVerIdentifier': str,
    'SmartScreen': str,
    'Firewall': str,
    'UacLuaenable': str,
    'Census_MDC2FormFactor': str,
    'Census_DeviceFamily': str,
    'Census_OEMNameIdentifier': str,
    'Census_OEMModelIdentifier': str, 
    'Census_ProcessorCoreCount': str,
    'Census_ProcessorManufacturerIdentifier': str,
    'Census_ProcessorModelIdentifier': str,
    'Census_ProcessorClass': str,
    'Census_PrimaryDiskTotalCapacity': np.float64,
    'Census_PrimaryDiskTypeName': str,
    'Census_SystemVolumeTotalCapacity': np.float64,
    'Census_HasOpticalDiskDrive': np.int8,
    'Census_TotalPhysicalRAM': np.float64,
    'Census_ChassisTypeName': str,
    'Census_InternalPrimaryDiagonalDisplaySizeInInches': str,
    'Census_InternalPrimaryDisplayResolutionHorizontal': str,
    'Census_InternalPrimaryDisplayResolutionVertical': str,
    'Census_PowerPlatformRoleName': str,
    'Census_InternalBatteryType': str,
    'Census_InternalBatteryNumberOfCharges': str,
    'Census_OSVersion': str,
    'Census_OSArchitecture': str,
    'Census_OSBranch': str,
    'Census_OSBuildNumber': str,
    'Census_OSBuildRevision': str,
    'Census_OSEdition': str,
    'Census_OSSkuName': str,
    'Census_OSInstallTypeName': str,
    'Census_OSInstallLanguageIdentifier': str,
    'Census_OSUILocaleIdentifier': str,
    'Census_OSWUAutoUpdateOptionsName': str,
    'Census_IsPortableOperatingSystem': np.int8,
    'Census_GenuineStateName': str,
    'Census_ActivationChannel': str,
    'Census_IsFlightingInternal': str,
    'Census_IsFlightsDisabled': str,
    'Census_FlightRing': str,
    'Census_ThresholdOptIn': str,
    'Census_FirmwareManufacturerIdentifier': str,
    'Census_FirmwareVersionIdentifier': str,
    'Census_IsSecureBootEnabled': np.int8,
    'Census_IsWIMBootEnabled': str,
    'Census_IsVirtualDevice': str,
    'Census_IsTouchEnabled': np.int8,
    'Census_IsPenCapable': np.int8,
    'Census_IsAlwaysOnAlwaysConnectedCapable': str,
    'Wdft_IsGamer': str,
    'Wdft_RegionIdentifier': str,
    'HasDetections': np.int8
}

full_features = pd.read_csv("./csv/train.csv", dtype=datatypes, index_col="MachineIdentifier")
#full_features = pd.read_csv("./csv/train.csv", dtype=datatypes, nrows=200000, index_col="MachineIdentifier")

In [407]:
# Shuffle the data
#np.random.seed(0)

shuffle = np.random.permutation(np.arange(full_features.shape[0]))[:500000]
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

In [408]:
full_labels = full_features["HasDetections"]

# Dropping labels ["HasDetections"] from training dataset
full_features = full_features.drop(["HasDetections"], axis=1)

In [409]:
print (full_features.shape)

(500000, 81)


In [410]:
# Checking the columns with the most NULL values
print((full_features.isnull().sum()).sort_values(ascending=False).head(20))

PuaMode                                  499884
Census_ProcessorClass                    497920
DefaultBrowsersIdentifier                475463
Census_IsFlightingInternal               415386
Census_InternalBatteryType               355053
Census_ThresholdOptIn                    317688
Census_IsWIMBootEnabled                  317241
SmartScreen                              178343
OrganizationIdentifier                   153953
SMode                                     30307
CityIdentifier                            18332
Wdft_IsGamer                              17033
Wdft_RegionIdentifier                     17033
Census_InternalBatteryNumberOfCharges     15043
Census_FirmwareManufacturerIdentifier     10103
Census_IsFlightsDisabled                   9018
Census_FirmwareVersionIdentifier           8812
Census_OEMModelIdentifier                  5672
Census_OEMNameIdentifier                   5261
Firewall                                   5177
dtype: int64


In [411]:
full_features['PuaMode'].unique()

array([nan, 'on'], dtype=object)

In [412]:
full_features['Census_IsFlightingInternal'].unique()

array(['0', nan], dtype=object)

In [413]:
full_features['Census_InternalBatteryType'].unique()

array([nan, 'lion', 'li-i', 'liio', '#', 'li p', 'pbac', 'lip', 'nimh',
       'bq20', 'lgs0', 'lipo', 'real', 'li', 'vbox', 'lgi0', 'lipp',
       'unkn', 'pad0', 'ithi', 'lhp0', 'virt', 'a132', '4cel', 'ram',
       'batt', 'ca48', 'ÿÿÿÿ', 'asmb'], dtype=object)

In [414]:
full_features['Census_ThresholdOptIn'].unique()

array(['0', nan, '1'], dtype=object)

In [415]:
full_features['Census_IsWIMBootEnabled'].unique()

array(['0', nan], dtype=object)

In [416]:
full_features['SMode'].unique()

array(['0', nan, '1'], dtype=object)

In [417]:
full_features['OrganizationIdentifier'].unique()

array(['27', '18', nan, '48', '50', '46', '52', '36', '11', '14', '8',
       '37', '49', '4', '2', '6', '33', '26', '40', '32', '5', '20', '7',
       '28', '1', '16', '51', '22', '3', '39', '47', '44', '31', '10',
       '21', '30', '43', '19', '42', '41', '45', '29', '15'], dtype=object)

In [418]:
full_features['Wdft_IsGamer'].unique()

array(['1', '0', nan], dtype=object)

In [419]:
full_features['Wdft_RegionIdentifier'].unique()

array(['11', '10', '3', '2', '1', '8', '13', '9', '4', '15', nan, '12',
       '5', '7', '6', '14'], dtype=object)

In [420]:
full_features['CityIdentifier'].unique()

array([nan, '134062', '66517', ..., '61559', '64692', '147213'],
      dtype=object)

In [421]:
full_features['Census_InternalBatteryNumberOfCharges'].unique()

array(['4294967295', '0', '8', ..., '25100', '14232', '57615'],
      dtype=object)

In [422]:
# Cleaning up some data

# PuaMode - Potentially Unwanted Applications, if NA, then it is disabled. 99% are NA. So, better to drop it
# Census_ProcessorClass - According to the description - "No longer maintained and updated"
# DefaultBrowsersIdentifier - Almost all values are empty. Therefore we will drop this column
# Census_IsFlightingInternal - whether this is internal or "external" testing ring. Column mostly unused. Will have to drop it
# Census_InternalBatteryType - comtains mostly garbage. Besides, it should not be relevant to attack surface.
# Census_ThresholdOptIn - also mostly unused. Googled it and Threshold was used in first versions of Windows 10. Looks like unused now
# Census_IsWIMBootEnabled - Is it possible to boot from Windows Image? Not relevant to identification of the attacks when 70% of data is emtpy
# SmartScreen - Whether smart screen in explorer is enabled. Should be important. "ExistsNotSet" when null, according to the description
# SMode - Quite relevant field. Will be keeping it
# OrganizationIdentifier - Attacks by organizations should be analyzed. If not filled, will assign "0". 
# Census_InternalBatteryNumberOfCharges - Not relevant. Will drop this column in order not to overtrain
# Census_OSSkuName -  OS edition friendly name (currently Windows only). - Can be removed. Duplicate field
# Census_ChassisTypeName - Census_MDC2FormFactor gives better information. Let's remove this field

full_features['PuaMode'] = full_features['PuaMode'].fillna('off')
full_features['SmartScreen'] = full_features['SmartScreen'].fillna('ExistsNotSet')
full_features['SMode'] = full_features['SMode'].fillna('0').astype('int8')
full_features['OrganizationIdentifier'] = full_features['OrganizationIdentifier'].fillna('0').astype('int32')
full_features['Wdft_IsGamer'] = full_features['Wdft_IsGamer'].fillna('0').astype('int8')
full_features['Wdft_RegionIdentifier'] = full_features['Wdft_RegionIdentifier'].fillna('0').astype('int32')
full_features['CityIdentifier'] = full_features['CityIdentifier'].fillna('0').astype('int32')

full_features = full_features.drop([
    'PuaMode',
    'Census_ProcessorClass',
    'DefaultBrowsersIdentifier',
    'Census_IsFlightingInternal',
    'Census_InternalBatteryType'], axis=1)

In [423]:
# Now let us check the string columns

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['ProductName',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'Platform',
 'Processor',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'IsProtected',
 'IeVerIdentifier',
 'SmartScreen',
 'Firewall',
 'UacLuaenable',
 'Census_MDC2FormFactor',
 'Census_DeviceFamily',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_PowerPlatformRoleName',
 'Census_InternalBatteryNumberOfCharges',
 'Census_OSVersion',
 'Census_OSArchitecture',
 'Cen

In [424]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,ProductName,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTypeName,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
78df1db09a83bb9ae275d7281337b30c,win8defender,1.1.14800.3,4.8.10240.17443,1.267.1090.0,7,62773,1,1,95,277,75,windows10,x86,10.0.0.0,10240,256,th1,10240.17443.x86fre.th1.170602-2340,Enterprise,0,53,ExistsNotSet,1,1,Desktop,Windows.Desktop,3469,275118,2,5,4322,HDD,Desktop,16.3,1024,768,Desktop,4294967295,10.0.10240.17443,x86,th1,10240,17443,Enterprise,ENTERPRISE,IBSClean,8,31,UNKNOWN,IS_GENUINE,Volume:GVLK,0,Retail,0.0,809,13303,0.0,0,0
04205ed42bf453f3471b0adf68119954,win8defender,1.1.15200.1,4.18.1807.18075,1.275.288.0,7,53447,1,1,29,35,171,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.1805...,Home,1,117,RequireAdmin,1,1,Desktop,Windows.Desktop,4909,317701,2,1,120,Unspecified,Desktop,17.0,1440,900,Desktop,4294967295,10.0.16299.431,amd64,rs3_release_svc_escrow,16299,431,Core,CORE,Upgrade,26,119,Notify,IS_GENUINE,Retail,0,Retail,0.0,142,53064,0.0,0,0
33aadbfd19773f36dc19087e4d247841,win8defender,1.1.15200.1,4.18.1807.18075,1.275.330.0,7,49480,2,1,171,211,182,windows10,x64,10.0.0.0,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1,135,RequireAdmin,1,1,Notebook,Windows.Desktop,585,208502,2,5,1998,HDD,Notebook,15.5,1366,768,Mobile,0,10.0.16299.125,amd64,rs3_release,16299,125,CoreSingleLanguage,CORE_SINGLELANGUAGE,Update,29,125,UNKNOWN,IS_GENUINE,OEM:DM,0,Retail,,556,63540,,0,0
95a9f8a8a6b1c9e1bf17004e0e253e37,win8defender,1.1.15100.1,4.8.10240.17443,1.273.1362.0,7,53447,1,1,12,15,58,windows10,x64,10.0.0.0,10240,256,th1,10240.17443.amd64fre.th1.170602-2340,Pro,1,53,RequireAdmin,1,1,Notebook,Windows.Desktop,1443,260856,4,5,2514,HDD,Laptop,14.0,1366,768,Mobile,0,10.0.10240.17443,amd64,th1_st1,10240,17443,Professional,PROFESSIONAL,IBSClean,8,31,UNKNOWN,INVALID_LICENSE,Volume:GVLK,0,NOT_SET,0.0,355,19973,0.0,0,0
6a32450ae25c9e76c5b18622f14afe5e,win8defender,1.1.15200.1,4.11.15063.0,1.275.1244.0,7,53447,1,1,201,277,75,windows10,x64,10.0.0.0,15063,256,rs2,15063.0.amd64fre.rs2_release.170317-1834,Pro,1,105,ExistsNotSet,1,1,Desktop,Windows.Desktop,2102,251924,4,5,2684,HDD,Desktop,21.5,1920,1080,Workstation,4294967295,10.0.15063.250,amd64,rs2_release,15063,250,Professional,PROFESSIONAL,Other,7,30,Notify,IS_GENUINE,OEM:DM,0,Retail,,486,48303,,0,0
404c11970409e292a3e55a1beda9fc49,win8defender,1.1.15200.1,4.18.1807.18075,1.275.722.0,7,49480,2,1,59,277,75,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1,137,Off,1,1,Notebook,Windows.Desktop,585,274984,8,5,2746,HDD,Notebook,15.5,1920,1080,Mobile,0,10.0.17134.228,amd64,rs4_release,17134,228,Core,CORE,UUPUpgrade,8,31,FullAuto,IS_GENUINE,OEM:DM,0,Retail,,556,63417,,0,0
34fa5187b95e1377d8e4cc68b23b5668,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1420.0,7,53447,1,1,214,277,75,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1,137,ExistsNotSet,1,1,Notebook,Windows.Desktop,2206,230219,4,5,2640,HDD,Notebook,15.5,1366,768,Mobile,0,10.0.17134.165,amd64,rs4_release,17134,165,Education,EDUCATION,UUPUpgrade,8,31,FullAuto,IS_GENUINE,Retail,0,Retail,,554,33070,,0,0
99a24a5009c8cf47383c472b08cd9624,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1712.0,7,53447,1,1,171,211,182,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1,137,ExistsNotSet,1,1,Desktop,Windows.Desktop,4909,317701,12,5,3117,SSD,Desktop,26.9,1920,1080,Desktop,4294967295,10.0.17134.285,amd64,rs4_release,17134,285,CoreSingleLanguage,CORE_SINGLELANGUAGE,UUPUpgrade,29,125,FullAuto,IS_GENUINE,OEM:NONSLP,0,Retail,,142,52458,,0,0
8f2e2d5416a4bbf33e99456f7b81ad92,win8defender,1.1.14800.3,4.14.17613.18039,1.267.538.0,7,62773,1,1,89,277,75,windows10,x64,10.0.0.0,16299,256,rs3,16299.15.amd64fre.rs3_release.170928-1534,Pro,0,117,RequireAdmin,1,1,Notebook,Windows.Desktop,1443,331893,4,5,3030,SSD,Laptop,13.3,1366,768,Mobile,0,10.0.16299.371,amd64,rs3_release,16299,371,Professional,PROFESSIONAL,UUPUpgrade,8,31,Notify,IS_GENUINE,Retail,0,Retail,,355,7255,,0,0
66f1259d0a4f4720d72e1f966faa48a1,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1695.0,7,53447,1,1,22,19,74,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1,137,ExistsNotSet,0,1,Desktop,Windows.Desktop,1443,275868,8,5,3079,HDD,Desktop,23.8,1920,1080,Desktop,4294967295,10.0.17134.285,amd64,rs4_release,17134,285,Professional,PROFESSIONAL,UUPUpgrade,7,30,FullAuto,IS_GENUINE,OEM:NONSLP,0,Retail,,355,7301,,0,0


At first glance at the data, it becomes obvious, that the stings are either classifiers, or versions that contain 4 classifiers in them. So. in order to use the algorithms that support only numeric values we will convert classifiers like "ProductName" to integer range and the fields like AppVersion

In [425]:
def df_replacevalues(df, colname, oldvalues, newvalues):
    # First, we need to get the most frequent value of the column
    topvalue = df[colname].value_counts().idxmax()
    
    # Replace NaN values with the popular value
    df[colname].fillna(topvalue, inplace=True)
    
    # We need to make sure no other value than oldvalues exists
    indexes = df[~df[colname].isin(oldvalues)].index
    
    # If the "Garbage" values are more than 1%, then raise an error
    if len(indexes) > len(df) / 100:
        raise Exception("Not all neccessary values are present in oldvalues array")
    
    # Replace "Garbage" with the top value
    df.loc[indexes,[colname]] = topvalue
    
    print ("Previous values", df[colname].unique())
    df[colname] = pd.to_numeric(df[colname].replace(oldvalues, newvalues), errors='raise', downcast='integer')
    print ("New values", df[colname].unique())
    
#full_features["Platform"].unique()
#full_features["Platform"].value_counts()
#full_features[~full_features["ProductName"].isin(['win8defender', 'mse'])].index

In [426]:
colname = "ProductName"
oldvalues = ['win8defender','mse','mseprerelease','windowsintune','fep','scep']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['win8defender' 'mse' 'scep' 'mseprerelease' 'fep']
New values [1 2 6 3 5]


In [427]:
colname = "Platform"
oldvalues = ['windows10','windows7','windows8','windows2016']
newvalues = [10,7,8,2016]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['windows10' 'windows8' 'windows7' 'windows2016']
New values [  10    8    7 2016]


In [428]:
colname = "Processor"
oldvalues = ['x64','arm64','x86']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['x86' 'x64' 'arm64']
New values [3 1 2]


In [429]:
colname = "OsPlatformSubRelease"
oldvalues = ['rs4','rs1','rs3','windows7','windows8.1','th1','rs2','th2','prers5']
newvalues = [504,501,503,507,508,201,502,202,405]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['th1' 'rs3' 'rs2' 'rs4' 'th2' 'windows8.1' 'windows7' 'rs1' 'prers5']
New values [201 503 502 504 202 508 507 501 405]


In [430]:
colname = "SkuEdition"
oldvalues = ['Pro','Home','Invalid','Enterprise LTSB','Enterprise','Education','Cloud','Server']
newvalues = [55,52,0,71,70,20,90,80]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Enterprise' 'Home' 'Pro' 'Education' 'Invalid' 'Enterprise LTSB'
 'Server' 'Cloud']
New values [70 52 55 20  0 71 80 90]


In [431]:
colname = "SmartScreen"
oldvalues = ['Off','off','OFF','On','on','Warn','Prompt','ExistsNotSet','Block','RequireAdmin']
newvalues = [0,0,0,1,1,2,3,4,5,6]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['ExistsNotSet' 'RequireAdmin' 'Off' 'Warn' 'Prompt' 'Block' 'off' 'On'
 'on']
New values [4 6 0 2 3 5 1]


In [432]:
colname = "Census_MDC2FormFactor"
oldvalues = ['Desktop','Notebook','Detachable','PCOther','AllInOne','Convertible','SmallTablet','LargeTablet','SmallServer','LargeServer','MediumServer','ServerOther','IoTOther']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Desktop' 'Notebook' 'AllInOne' 'PCOther' 'Detachable' 'Convertible'
 'LargeTablet' 'LargeServer' 'SmallServer' 'SmallTablet' 'MediumServer'
 'ServerOther']
New values [ 1  2  5  4  3  6  8 10  9  7 11 12]


In [433]:
# Census_DeviceFamily ['Windows.Desktop' 'Windows.Server' 'Windows']

colname = "Census_DeviceFamily"
oldvalues = ['Windows.Desktop','Windows.Server','Windows']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Windows.Desktop' 'Windows.Server' 'Windows']
New values [1 2 3]


In [434]:
# Census_PrimaryDiskTypeName ['HDD' 'SSD' 'UNKNOWN' 'Unspecified' nan]

colname = "Census_PrimaryDiskTypeName"
oldvalues = ['HDD','SSD','UNKNOWN','Unspecified']
newvalues = [1,2,3,3]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['HDD' 'Unspecified' 'SSD' 'UNKNOWN']
New values [1 3 2]


In [435]:
# Census_ChassisTypeName Index(['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 'RackMountChassis', 'SubNotebook', 'BusExpansionChassis', '30', 'StickPC', '0', 'MultisystemChassis', 'Blade', '35', 'PizzaBox', 'SealedCasePC', 'SubChassis', 'ExpansionChassis', '31', '32', '88', '127', '25', '44', '36', 'DockingStation', 'BladeEnclosure', 'CompactPCI', '81', '45', 'EmbeddedPC', '28', '82', '112', 'IoTGateway', '49', '76', '39'], dtype='object')

colname = "Census_ChassisTypeName"
oldvalues = ['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 
             'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 
             'RackMountChassis', 'SubNotebook', 'BusExpansionChassis']
newvalues = [1,2,1,1,3,4,5,6,-1,7,8,9,10,11,12,-1,13,2,14,15,1,16]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Desktop' 'Notebook' 'Laptop' 'AllinOne' 'Portable' 'UNKNOWN' 'Unknown'
 'LunchBox' 'Convertible' 'MiniTower' 'LowProfileDesktop' 'Other' 'Tower'
 'SpaceSaving' 'Detachable' 'HandHeld' 'Tablet' 'MainServerChassis'
 'RackMountChassis' 'MiniPC' 'SubNotebook' 'BusExpansionChassis']
New values [ 2  1  3 -1 14  5  4  8  6 12 10  7  9 11 13 15 16]


In [436]:
# Census_PowerPlatformRoleName Index(['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']

colname = "Census_PowerPlatformRoleName"
full_features[colname] = full_features[colname].fillna('UNKNOWN')
oldvalues = ['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']
newvalues = [1,2,3,2,4,0,5,6,7,0]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Desktop' 'Mobile' 'Workstation' 'Slate' 'AppliancePC' 'SOHOServer'
 'UNKNOWN' 'EnterpriseServer' 'PerformanceServer' 'Unspecified']
New values [2 1 3 6 4 0 5 7]


In [437]:
# Census_OSArchitecture Index(['amd64', 'x86', 'arm64'], dtype='object')

colname = "Census_OSArchitecture"
oldvalues = ['amd64', 'x86', 'arm64']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['x86' 'amd64' 'arm64']
New values [3 1 2]


In [438]:
# Census_OSBranch Index(['rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 'rs2_release', 'rs1_release', 'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'rs5_release', 'rs3_release_svc_escrow_im', 'rs_prerelease', 'rs_prerelease_flt', 'rs5_release_sigma', 'rs1_release_srvmedia', 'winblue_ltsb_escrow', 'win7sp1_ldr', 'winblue_ltsb', 'win8_gdr', 'rs_xbox', 'rs5_release_edge', 'rs5_release_sigma_dev', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'rs_shell', 'rs1_release_svc', 'win8_ldr', 'rs_onecore_base_cobalt', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs3_release_svc', 'Khmer OS'], dtype='object')

colname = "Census_OSBranch"
oldvalues = ['rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 'rs2_release', 'rs1_release', 'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'rs5_release', 'rs3_release_svc_escrow_im', 'rs_prerelease', 'rs_prerelease_flt', 'rs5_release_sigma']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['th1' 'rs3_release_svc_escrow' 'rs3_release' 'th1_st1' 'rs2_release'
 'rs4_release' 'th2_release' 'th2_release_sec' 'rs1_release' 'rs5_release'
 'rs3_release_svc_escrow_im' 'rs_prerelease' 'rs_prerelease_flt'
 'rs5_release_sigma']
New values [ 9  3  2  8  4  1  6  7  5 10 11 12 13 14]


In [439]:
# Census_OSSkuName Index(['CORE', 'PROFESSIONAL', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'EDUCATION', 'ENTERPRISE', 'PROFESSIONAL_N', 'ENTERPRISE_S', 'STANDARD_SERVER', 'CLOUD', 'CORE_N', 'STANDARD_EVALUATION_SERVER', 'EDUCATION_N', 'ENTERPRISE_S_N', 'DATACENTER_EVALUATION_SERVER', 'SB_SOLUTION_SERVER', 'ENTERPRISE_N', 'PRO_WORKSTATION', 'UNLICENSED', 'DATACENTER_SERVER', 'PRO_WORKSTATION_N', 'CLOUDN', 'PRO_CHINA', 'SERVERRDSH', 'ULTIMATE', 'PRO_FOR_EDUCATION', 'PRO_SINGLE_LANGUAGE', 'UNDEFINED', 'STARTER', 'ENTERPRISEG'], dtype='object')

colname = "Census_OSSkuName"
oldvalues = ['CORE', 'PROFESSIONAL', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'EDUCATION', 'ENTERPRISE', 'PROFESSIONAL_N', 'ENTERPRISE_S', 'STANDARD_SERVER', 'CLOUD', 'CORE_N', 'STANDARD_EVALUATION_SERVER', 'EDUCATION_N', 'ENTERPRISE_S_N', 'DATACENTER_EVALUATION_SERVER', 'SB_SOLUTION_SERVER', 'ENTERPRISE_N', 'PRO_WORKSTATION', 'UNLICENSED']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['ENTERPRISE' 'CORE' 'CORE_SINGLELANGUAGE' 'PROFESSIONAL' 'EDUCATION'
 'CORE_COUNTRYSPECIFIC' 'ENTERPRISE_S' 'PROFESSIONAL_N'
 'STANDARD_EVALUATION_SERVER' 'STANDARD_SERVER' 'CLOUD' 'ENTERPRISE_S_N'
 'CORE_N' 'EDUCATION_N' 'ENTERPRISE_N' 'SB_SOLUTION_SERVER'
 'DATACENTER_EVALUATION_SERVER' 'PRO_WORKSTATION' 'UNLICENSED']
New values [ 6  1  3  2  5  4  8  7 12  9 10 14 11 13 17 16 15 18 19]


In [440]:
# Census_OSInstallTypeName Index(['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh'], dtype='object')

colname = "Census_OSInstallTypeName"
oldvalues = ['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['IBSClean' 'Upgrade' 'Update' 'Other' 'UUPUpgrade' 'Refresh'
 'CleanPCRefresh' 'Reset' 'Clean']
New values [2 4 3 5 1 7 9 6 8]


In [441]:
# Census_OSWUAutoUpdateOptionsName Index(['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify'], dtype='object')

colname = "Census_OSWUAutoUpdateOptionsName"
oldvalues = ['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['UNKNOWN' 'Notify' 'FullAuto' 'AutoInstallAndRebootAtMaintenanceTime'
 'Off' 'DownloadNotify']
New values [2 3 1 4 5 6]


In [442]:
# Census_GenuineStateName Index(['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED'], dtype='object')

colname = "Census_GenuineStateName"
oldvalues = ['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['IS_GENUINE' 'INVALID_LICENSE' 'OFFLINE' 'UNKNOWN']
New values [1 2 3 4]


In [443]:
# Census_ActivationChannel Index(['Retail', 'OEM:DM', 'Volume:GVLK', 'OEM:NONSLP', 'Volume:MAK', 'Retail:TB:Eval'], dtype='object')

colname = "Census_ActivationChannel"
oldvalues = ['Retail', 'OEM:DM', 'Volume:GVLK', 'OEM:NONSLP', 'Volume:MAK', 'Retail:TB:Eval']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Volume:GVLK' 'Retail' 'OEM:DM' 'OEM:NONSLP' 'Retail:TB:Eval'
 'Volume:MAK']
New values [3 1 2 4 6 5]


In [444]:
# Census_FlightRing Index(['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary'], dtype='object')

colname = "Census_FlightRing"
oldvalues = ['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary']
newvalues = [1,2,0,3,4,5,0,0,0,0,0]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Retail' 'NOT_SET' 'Unknown' 'WIF' 'Disabled' 'WIS' 'RP']
New values [1 2 0 4 3 5]


In [445]:
# PuaMode Index(['off', 'on', 'audit'], dtype='object')

#colname = "PuaMode"
#oldvalues = ['off', 'on', 'audit']
#newvalues = [0,1,2]

#df_replacevalues(full_features, colname, oldvalues, newvalues)

In [446]:
# Census_OSEdition

colname = "Census_OSEdition"
oldvalues = ['Core','Professional','CoreSingleLanguage','CoreCountrySpecific','ProfessionalEducation','Education',
             'Enterprise','ProfessionalN','EnterpriseS','ServerStandard','Cloud','CoreN']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['Enterprise' 'Core' 'CoreSingleLanguage' 'Professional' 'Education'
 'CoreCountrySpecific' 'ProfessionalEducation' 'EnterpriseS'
 'ProfessionalN' 'ServerStandard' 'Cloud' 'CoreN']
New values [ 7  1  3  2  6  4  5  9  8 10 11 12]


In [447]:
# Now let us check the string columns again

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsBuildLab',
 'IsProtected',
 'IeVerIdentifier',
 'Firewall',
 'UacLuaenable',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_InternalBatteryNumberOfCharges',
 'Census_OSVersion',
 'Census_OSBuildNumber',
 'Census_OSBuildRevision',
 'Census_OSInstallLanguageIdentifier',
 'Census_OSUILocaleIdentifier',
 'Census_IsFlightsDisabled',
 'Census_ThresholdOptIn',
 'Census_FirmwareManufacturerIdentifier',
 'Census_FirmwareVersionIdentifier',
 'Census_IsWIM

In [448]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,OsVer,OsBuild,OsSuite,OsBuildLab,IsProtected,IeVerIdentifier,Firewall,UacLuaenable,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_IsFlightsDisabled,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
78df1db09a83bb9ae275d7281337b30c,1.1.14800.3,4.8.10240.17443,1.267.1090.0,7,62773,1,1,95,277,75,10.0.0.0,10240,256,10240.17443.x86fre.th1.170602-2340,0,53,1,1,3469,275118,2,5,4322,16.3,1024,768,4294967295,10.0.10240.17443,10240,17443,8,31,0,0.0,809,13303,0.0,0,0
04205ed42bf453f3471b0adf68119954,1.1.15200.1,4.18.1807.18075,1.275.288.0,7,53447,1,1,29,35,171,10.0.0.0,16299,768,16299.431.amd64fre.rs3_release_svc_escrow.1805...,1,117,1,1,4909,317701,2,1,120,17.0,1440,900,4294967295,10.0.16299.431,16299,431,26,119,0,0.0,142,53064,0.0,0,0
33aadbfd19773f36dc19087e4d247841,1.1.15200.1,4.18.1807.18075,1.275.330.0,7,49480,2,1,171,211,182,10.0.0.0,16299,768,16299.15.amd64fre.rs3_release.170928-1534,1,135,1,1,585,208502,2,5,1998,15.5,1366,768,0,10.0.16299.125,16299,125,29,125,0,,556,63540,,0,0
95a9f8a8a6b1c9e1bf17004e0e253e37,1.1.15100.1,4.8.10240.17443,1.273.1362.0,7,53447,1,1,12,15,58,10.0.0.0,10240,256,10240.17443.amd64fre.th1.170602-2340,1,53,1,1,1443,260856,4,5,2514,14.0,1366,768,0,10.0.10240.17443,10240,17443,8,31,0,0.0,355,19973,0.0,0,0
6a32450ae25c9e76c5b18622f14afe5e,1.1.15200.1,4.11.15063.0,1.275.1244.0,7,53447,1,1,201,277,75,10.0.0.0,15063,256,15063.0.amd64fre.rs2_release.170317-1834,1,105,1,1,2102,251924,4,5,2684,21.5,1920,1080,4294967295,10.0.15063.250,15063,250,7,30,0,,486,48303,,0,0
404c11970409e292a3e55a1beda9fc49,1.1.15200.1,4.18.1807.18075,1.275.722.0,7,49480,2,1,59,277,75,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1,137,1,1,585,274984,8,5,2746,15.5,1920,1080,0,10.0.17134.228,17134,228,8,31,0,,556,63417,,0,0
34fa5187b95e1377d8e4cc68b23b5668,1.1.15100.1,4.18.1807.18075,1.273.1420.0,7,53447,1,1,214,277,75,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137,1,1,2206,230219,4,5,2640,15.5,1366,768,0,10.0.17134.165,17134,165,8,31,0,,554,33070,,0,0
99a24a5009c8cf47383c472b08cd9624,1.1.15200.1,4.18.1807.18075,1.275.1712.0,7,53447,1,1,171,211,182,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1,137,1,1,4909,317701,12,5,3117,26.9,1920,1080,4294967295,10.0.17134.285,17134,285,29,125,0,,142,52458,,0,0
8f2e2d5416a4bbf33e99456f7b81ad92,1.1.14800.3,4.14.17613.18039,1.267.538.0,7,62773,1,1,89,277,75,10.0.0.0,16299,256,16299.15.amd64fre.rs3_release.170928-1534,0,117,1,1,1443,331893,4,5,3030,13.3,1366,768,0,10.0.16299.371,16299,371,8,31,0,,355,7255,,0,0
66f1259d0a4f4720d72e1f966faa48a1,1.1.15200.1,4.18.1807.18075,1.275.1695.0,7,53447,1,1,22,19,74,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1,137,0,1,1443,275868,8,5,3079,23.8,1920,1080,4294967295,10.0.17134.285,17134,285,7,30,0,,355,7301,,0,0


In [449]:
# Now we need to process the columns that contain version numbers
# We will split them in 4-5 different columns

versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']
newcolumnnames = []

for colname in versions:
    data = full_features[colname].str.split(r"\.|-",expand=True) # Split if '.' or '-'
    for i in range(data.shape[1]):
        newcolumnname = "%s_%d" % (colname, i+1)
        newcolumnnames.append(newcolumnname)
        full_features[newcolumnname] = data[i]

In [450]:
full_features[newcolumnnames].head(10)

Unnamed: 0_level_0,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
78df1db09a83bb9ae275d7281337b30c,1,1,14800,3,4,8,10240,17443,1,267,1090,0,10,0,0,0,10240,17443,x86fre,th1,170602,2340,10,0,10240,17443
04205ed42bf453f3471b0adf68119954,1,1,15200,1,4,18,1807,18075,1,275,288,0,10,0,0,0,16299,431,amd64fre,rs3_release_svc_escrow,180502,1908,10,0,16299,431
33aadbfd19773f36dc19087e4d247841,1,1,15200,1,4,18,1807,18075,1,275,330,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,125
95a9f8a8a6b1c9e1bf17004e0e253e37,1,1,15100,1,4,8,10240,17443,1,273,1362,0,10,0,0,0,10240,17443,amd64fre,th1,170602,2340,10,0,10240,17443
6a32450ae25c9e76c5b18622f14afe5e,1,1,15200,1,4,11,15063,0,1,275,1244,0,10,0,0,0,15063,0,amd64fre,rs2_release,170317,1834,10,0,15063,250
404c11970409e292a3e55a1beda9fc49,1,1,15200,1,4,18,1807,18075,1,275,722,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
34fa5187b95e1377d8e4cc68b23b5668,1,1,15100,1,4,18,1807,18075,1,273,1420,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,165
99a24a5009c8cf47383c472b08cd9624,1,1,15200,1,4,18,1807,18075,1,275,1712,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,285
8f2e2d5416a4bbf33e99456f7b81ad92,1,1,14800,3,4,14,17613,18039,1,267,538,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,371
66f1259d0a4f4720d72e1f966faa48a1,1,1,15200,1,4,18,1807,18075,1,275,1695,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,285


In [451]:
#colname = "OsBuildLab_4"
#print (full_features[colname].value_counts())
#print (colname, full_features[colname].value_counts().keys())

In [452]:
# After splitting the columns, the only values we need to remap are OsBuildLab_3 and OsBuildLab_4
# Other values are already numeric

# OsBuildLab_3 Index(['amd64fre', 'x86fre', 'arm64fre'], dtype='object')

colname = "OsBuildLab_3"
oldvalues = ['amd64fre', 'x86fre', 'arm64fre']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['x86fre' 'amd64fre' 'arm64fre']
New values [3 1 2]


In [453]:
# OsBuildLab_4 Index(['rs4_release', 'rs3_release_svc_escrow', 'rs3_release', 'rs2_release', 'rs1_release', 'th2_release_sec', 'th1', 'winblue_ltsb_escrow', 'th2_release', 'rs1_release_inmarket', 'winblue_ltsb', 'win7sp1_ldr', 'rs3_release_svc', 'rs1_release_1', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'th1_st1', 'rs5_release', 'rs1_release_inmarket_aim', 'rs3_release_svc_escrow_im', 'th2_release_inmarket', 'rs_prerelease', 'rs_prerelease_flt', 'win7sp1_gdr', 'winblue_gdr', 'th1_escrow', 'win7_gdr', 'winblue_r4', 'rs1_release_inmarket_rim', 'rs1_release_d', 'winblue_r9', 'winblue_r5', 'win7_rtm', 'win7sp1_rtm', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'rs5_release_sigma', 'win7_ldr', 'rs5_release_sigma_dev', 'rs_xbox', 'rs5_release_edge', 'winblue_rtm', 'win7sp1_rc', 'rs3_release_svc_sec', 'rs_onecore_base_cobalt', 'rs6_prerelease', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs_shell']

colname = "OsBuildLab_4"
oldvalues = ['rs4_release', 'rs3_release_svc_escrow', 'rs3_release', 'rs2_release', 'rs1_release', 'th2_release_sec', 'th1', 'winblue_ltsb_escrow', 'th2_release', 'rs1_release_inmarket', 'winblue_ltsb', 'win7sp1_ldr', 'rs3_release_svc', 'rs1_release_1', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'th1_st1', 'rs5_release', 'rs1_release_inmarket_aim', 'rs3_release_svc_escrow_im', 'th2_release_inmarket', 'rs_prerelease', 'rs_prerelease_flt', 'win7sp1_gdr', 'winblue_gdr', 'th1_escrow', 'win7_gdr', 'winblue_r4', 'rs1_release_inmarket_rim', 'rs1_release_d', 'winblue_r9', 'winblue_r5', 'win7_rtm', 'win7sp1_rtm', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'rs5_release_sigma', 'win7_ldr', 'rs5_release_sigma_dev', 'rs_xbox', 'rs5_release_edge', 'winblue_rtm', 'win7sp1_rc', 'rs3_release_svc_sec', 'rs_onecore_base_cobalt', 'rs6_prerelease', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs_shell']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues)

Previous values ['th1' 'rs3_release_svc_escrow' 'rs3_release' 'rs2_release' 'rs4_release'
 'th2_release' 'winblue_ltsb_escrow' 'th2_release_sec' 'rs3_release_svc'
 'win7sp1_ldr_escrow' 'rs1_release' 'rs1_release_sec'
 'rs1_release_inmarket' 'rs5_release' 'win7sp1_ldr' 'th1_st1'
 'rs1_release_1' 'winblue_ltsb' 'rs1_release_inmarket_aim'
 'th2_release_inmarket' 'rs3_release_svc_escrow_im' 'rs_prerelease'
 'winblue_r9' 'win7sp1_gdr' 'rs_prerelease_flt' 'win7_gdr' 'win7sp1_rtm'
 'winblue_r4' 'winblue_gdr' 'th1_escrow' 'rs1_release_d' 'winblue_r7'
 'winblue_r5' 'rs1_release_inmarket_rim' 'rs5_release_sigma'
 'rs5_release_edge' 'win7_rtm' 'winblue_r8' 'winblue_r3']
New values [ 7  2  3  4  1  9  8  6 13 15  5 16 10 18 12 17 14 11 19 21 20 22 31 24
 23 27 34 28 25 26 30 35 32 29 38 42 33 37 36]


In [454]:
versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']

full_features = full_features.drop(versions, axis=1)

In [455]:
for colname in full_features.columns:
    if full_features[colname].dtypes.name not in ["int8","int16","int32"]:
        full_features[colname] = pd.to_numeric(full_features[colname], errors='coerce')
        topvalue = full_features[colname].value_counts().idxmax()
        full_features[colname].fillna(topvalue, inplace=True)

In [456]:
full_features.head(10)

Unnamed: 0_level_0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1
78df1db09a83bb9ae275d7281337b30c,1,0,7.0,0,62773.0,1.0,1.0,1,95,0,27,277.0,75,10,3,10240,256,201,70,0.0,0,0,53.0,4,1.0,1.0,1,1,3469.0,275118.0,2.0,5.0,4322.0,305245.0,1,51000.0,1,1024.0,2,16.3,1024.0,768.0,2,4294967000.0,3,9,10240,17443,7,6,2,8.0,31,2,0,1,3,0.0,1,0.0,809.0,13303.0,0,0.0,0.0,0,0,0.0,1,11,1,1,14800,3,4,8,10240,17443,1,267,1090,0,10,0,0,0,10240,17443,3,7,170602,2340,10,0,10240,17443
04205ed42bf453f3471b0adf68119954,1,0,7.0,0,53447.0,1.0,1.0,1,29,134062,18,35.0,171,10,1,16299,768,503,52,1.0,0,0,117.0,6,1.0,1.0,1,1,4909.0,317701.0,2.0,1.0,120.0,381554.0,3,199550.0,0,4096.0,2,17.0,1440.0,900.0,2,4294967000.0,1,3,16299,431,1,1,4,26.0,119,3,0,1,1,0.0,1,0.0,142.0,53064.0,0,0.0,0.0,0,0,0.0,0,10,1,1,15200,1,4,18,1807,18075,1,275,288,0,10,0,0,0,16299,431,1,2,180502,1908,10,0,16299,431
33aadbfd19773f36dc19087e4d247841,1,0,7.0,0,49480.0,2.0,1.0,1,171,66517,27,211.0,182,10,1,16299,768,503,52,1.0,0,0,135.0,6,1.0,1.0,2,1,585.0,208502.0,2.0,5.0,1998.0,476940.0,1,475799.0,0,4096.0,1,15.5,1366.0,768.0,1,0.0,1,2,16299,125,3,3,3,29.0,125,2,0,1,2,0.0,1,0.0,556.0,63540.0,1,0.0,0.0,0,0,0.0,0,3,1,1,15200,1,4,18,1807,18075,1,275,330,0,10,0,0,0,16299,15,1,3,170928,1534,10,0,16299,125
95a9f8a8a6b1c9e1bf17004e0e253e37,1,0,7.0,0,53447.0,1.0,1.0,1,12,109853,27,15.0,58,10,1,10240,256,201,55,1.0,0,0,53.0,6,1.0,1.0,2,1,1443.0,260856.0,4.0,5.0,2514.0,476940.0,1,476438.0,0,4096.0,1,14.0,1366.0,768.0,1,0.0,1,8,10240,17443,2,2,2,8.0,31,2,0,2,3,0.0,2,0.0,355.0,19973.0,0,0.0,0.0,0,0,0.0,0,2,1,1,15100,1,4,8,10240,17443,1,273,1362,0,10,0,0,0,10240,17443,1,7,170602,2340,10,0,10240,17443
6a32450ae25c9e76c5b18622f14afe5e,1,0,7.0,0,53447.0,1.0,1.0,1,201,4785,18,277.0,75,10,1,15063,256,502,55,1.0,0,0,105.0,4,1.0,1.0,1,1,2102.0,251924.0,4.0,5.0,2684.0,953869.0,1,936087.0,0,8192.0,2,21.5,1920.0,1080.0,2,4294967000.0,1,4,15063,250,2,2,5,7.0,30,3,0,1,2,0.0,1,0.0,486.0,48303.0,1,0.0,0.0,0,0,0.0,0,11,1,1,15200,1,4,11,15063,0,1,275,1244,0,10,0,0,0,15063,0,1,4,170317,1834,10,0,15063,250
404c11970409e292a3e55a1beda9fc49,1,0,7.0,0,49480.0,2.0,1.0,1,59,22656,27,277.0,75,10,1,17134,768,504,52,1.0,0,0,137.0,0,1.0,1.0,2,1,585.0,274984.0,8.0,5.0,2746.0,953869.0,1,952727.0,1,8192.0,1,15.5,1920.0,1080.0,1,0.0,1,1,17134,228,1,1,1,8.0,31,1,0,1,2,0.0,1,0.0,556.0,63417.0,1,0.0,0.0,0,0,0.0,1,11,1,1,15200,1,4,18,1807,18075,1,275,722,0,10,0,0,0,17134,1,1,1,180410,1804,10,0,17134,228
34fa5187b95e1377d8e4cc68b23b5668,1,0,7.0,0,53447.0,1.0,1.0,1,214,13249,0,277.0,75,10,1,17134,256,504,20,1.0,0,0,137.0,4,1.0,1.0,2,1,2206.0,230219.0,4.0,5.0,2640.0,476940.0,1,101741.0,0,4096.0,1,15.5,1366.0,768.0,1,0.0,1,1,17134,165,6,5,1,8.0,31,1,0,1,1,0.0,1,0.0,554.0,33070.0,0,0.0,0.0,0,0,0.0,0,1,1,1,15100,1,4,18,1807,18075,1,273,1420,0,10,0,0,0,17134,1,1,1,180410,1804,10,0,17134,165
99a24a5009c8cf47383c472b08cd9624,1,0,7.0,0,53447.0,1.0,1.0,1,171,65859,18,211.0,182,10,1,17134,768,504,52,1.0,0,0,137.0,4,1.0,1.0,1,1,4909.0,317701.0,12.0,5.0,3117.0,238475.0,2,237381.0,0,16384.0,2,26.9,1920.0,1080.0,2,4294967000.0,1,1,17134,285,3,3,1,29.0,125,1,0,1,4,0.0,1,0.0,142.0,52458.0,0,0.0,0.0,0,0,0.0,1,3,1,1,15200,1,4,18,1807,18075,1,275,1712,0,10,0,0,0,17134,1,1,1,180410,1804,10,0,17134,285
8f2e2d5416a4bbf33e99456f7b81ad92,1,0,7.0,0,62773.0,1.0,1.0,1,89,144275,27,277.0,75,10,1,16299,256,503,55,0.0,0,0,117.0,6,1.0,1.0,2,1,1443.0,331893.0,4.0,5.0,3030.0,244198.0,2,178646.0,0,8192.0,1,13.3,1366.0,768.0,1,0.0,1,2,16299,371,2,2,1,8.0,31,3,0,1,1,0.0,1,0.0,355.0,7255.0,0,0.0,0.0,1,0,0.0,0,1,1,1,14800,3,4,14,17613,18039,1,267,538,0,10,0,0,0,16299,15,1,3,170928,1534,10,0,16299,371
66f1259d0a4f4720d72e1f966faa48a1,1,0,7.0,0,53447.0,1.0,1.0,1,22,2073,27,19.0,74,10,1,17134,256,504,55,1.0,0,0,137.0,4,0.0,1.0,1,1,1443.0,275868.0,8.0,5.0,3079.0,953869.0,1,441134.0,0,16384.0,2,23.8,1920.0,1080.0,2,4294967000.0,1,1,17134,285,2,2,1,7.0,30,1,0,1,4,0.0,1,0.0,355.0,7301.0,1,0.0,0.0,0,0,0.0,0,11,1,1,15200,1,4,18,1807,18075,1,275,1695,0,10,0,0,0,17134,1,1,1,180410,1804,10,0,17134,285


In [457]:
# Let's see some details of the loaded data
full_features.describe()

Unnamed: 0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,1.010948,1.4e-05,6.842316,0.01764,47829.367432,1.326734,1.021244,0.987684,107.85262,78369.934028,17.206384,169.478772,122.82748,13.162024,1.182666,15718.366288,575.37648,480.064772,52.62182,0.94629,3e-05,0.000444,126.67964,4.850516,0.979002,0.995296,2.199532,1.001624,2221.162386,239976.35473,3.990136,4.536358,2374.611462,513012.9,1.419094,375513.0,0.076904,6102.633904,1.544468,16.675433,1546.642086,896.889144,1.368858,1087452000.0,1.182586,2.644406,15833.03259,974.948748,1.978268,1.952288,2.945614,14.566796,60.46987,1.883602,0.000534,1.146568,1.597176,8e-06,1.01477,8.2e-05,397.87541,33067.115612,0.486328,0.0,0.00691,0.12538,0.03797,0.056784,0.274286,7.615108,1.0,1.0,15073.946844,1.29628,4.0,15.857022,5659.000976,14078.094548,0.999988,272.3398,935.437368,0.0,9.870152,0.075914,0.010808,7.2e-05,15718.360898,1426.215526,1.182666,3.077894,176515.257138,1777.13815,9.999992,2e-06,15833.03259,974.947432
std,0.105262,0.003742,1.034612,0.131639,14038.36691,0.523686,0.168347,0.110292,62.976884,50364.735859,12.386641,89.311365,69.26977,80.529449,0.576142,2193.454442,248.029388,80.27171,5.893891,0.225445,0.005477,0.021067,42.640397,1.271068,0.143378,0.251718,1.316243,0.040316,1308.39535,71815.182731,2.064681,1.280881,836.799424,359487.3,0.621253,324566.7,0.266439,5167.785271,1.479233,5.903868,367.610869,214.512125,0.626165,1867625000.0,0.576029,2.026149,1964.140294,2935.740269,1.148348,1.089955,1.818161,10.171607,44.971147,0.93676,0.023102,0.431993,0.757574,0.002828,0.304355,0.009055,222.273155,21033.519545,0.499814,0.0,0.082839,0.33115,0.191124,0.23143,0.446154,4.696168,0.0,0.0,278.497325,1.023396,0.0,3.32704,6262.911942,7388.505248,0.003464,6.113678,532.108138,0.0,0.708895,0.447781,0.105637,0.036,2193.454274,4620.543187,0.576142,3.128643,6045.119166,211.715879,0.005657,0.001414,1964.140294,2935.740473
min,1.0,0.0,0.0,0.0,39.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,7.0,1.0,7600.0,16.0,201.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,1.0,1.0,74.0,22.0,1.0,1.0,3.0,10240.0,1.0,0.0,0.0,512.0,-1.0,4.9,-1.0,-1.0,0.0,0.0,1.0,1.0,7601.0,0.0,1.0,1.0,1.0,1.0,5.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,9.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,11701.0,0.0,4.0,4.0,203.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,7600.0,0.0,1.0,1.0,90713.0,100.0,6.0,0.0,7601.0,0.0
25%,1.0,0.0,7.0,0.0,49480.0,1.0,1.0,1.0,51.0,31368.0,0.0,89.0,75.0,10.0,1.0,15063.0,256.0,503.0,52.0,1.0,0.0,0.0,111.0,4.0,1.0,1.0,2.0,1.0,1443.0,189819.0,2.0,5.0,1998.0,239372.0,1.0,120561.8,0.0,4096.0,1.0,13.9,1366.0,768.0,1.0,0.0,1.0,1.0,15063.0,167.0,1.0,1.0,1.0,8.0,31.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,142.0,13299.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,1.0,15100.0,1.0,4.0,13.0,1807.0,17443.0,1.0,273.0,499.0,0.0,10.0,0.0,0.0,0.0,15063.0,1.0,1.0,1.0,170928.0,1804.0,10.0,0.0,15063.0,167.0
50%,1.0,0.0,7.0,0.0,53447.0,1.0,1.0,1.0,97.0,77866.0,18.0,181.0,88.0,10.0,1.0,16299.0,768.0,503.0,52.0,1.0,0.0,0.0,135.0,4.0,1.0,1.0,2.0,1.0,2102.0,248045.0,4.0,5.0,2503.0,476940.0,1.0,249039.0,0.0,4096.0,1.0,15.5,1366.0,768.0,1.0,0.0,1.0,2.0,16299.0,285.0,2.0,2.0,3.0,9.0,34.0,2.0,0.0,1.0,1.0,0.0,1.0,0.0,486.0,33076.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,15100.0,1.0,4.0,18.0,1807.0,18075.0,1.0,273.0,948.0,0.0,10.0,0.0,0.0,0.0,16299.0,1.0,1.0,2.0,180410.0,1804.0,10.0,0.0,16299.0,285.0
75%,1.0,0.0,7.0,0.0,53447.0,2.0,1.0,1.0,160.0,121274.0,27.0,267.0,182.0,10.0,1.0,17134.0,768.0,504.0,55.0,1.0,0.0,0.0,137.0,6.0,1.0,1.0,2.0,1.0,2668.0,308207.5,4.0,5.0,2874.0,953869.0,2.0,475968.0,0.0,8192.0,2.0,17.2,1920.0,1080.0,2.0,4294967000.0,1.0,4.0,17134.0,547.0,3.0,3.0,4.0,20.0,90.0,3.0,0.0,1.0,2.0,0.0,1.0,0.0,556.0,52302.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,1.0,1.0,15200.0,1.0,4.0,18.0,10586.0,18075.0,1.0,275.0,1379.0,0.0,10.0,0.0,0.0,0.0,17134.0,431.0,1.0,4.0,180410.0,1834.0,10.0,0.0,17134.0,547.0
max,6.0,1.0,35.0,1.0,70507.0,6.0,5.0,1.0,222.0,167962.0,52.0,296.0,283.0,2016.0,3.0,18242.0,784.0,508.0,90.0,1.0,1.0,1.0,429.0,6.0,1.0,48.0,12.0,3.0,6144.0,345494.0,88.0,10.0,4472.0,34338800.0,3.0,7630601.0,1.0,671744.0,16.0,142.0,11520.0,6480.0,7.0,4294967000.0,3.0,14.0,18242.0,24241.0,12.0,19.0,9.0,39.0,162.0,6.0,1.0,4.0,6.0,1.0,5.0,1.0,1087.0,72097.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,15.0,1.0,1.0,15300.0,6.0,4.0,18.0,17686.0,20082.0,1.0,277.0,4320.0,0.0,10.0,3.0,7.0,18.0,18242.0,24236.0,3.0,42.0,180914.0,2340.0,10.0,1.0,18242.0,24241.0


In [458]:
full_features['UacLuaenable'].unique()

array([ 1.,  0., 48.,  2.,  3.])

In [459]:
#[] (180000, 97) (20000, 97) (180000,) (20000,) AdaBoostClassifier 61.675000000000004
#['AvSigVersion_1', 'AvSigVersion_2', 'AvSigVersion_3', 'AvSigVersion_4'] (180000, 93) (20000, 93) (180000,) (20000,) AdaBoostClassifier 61.46
#['Census_InternalPrimaryDiagonalDisplaySizeInInches'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.695
#['Census_OSEdition'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.4
#['Census_PrimaryDiskTotalCapacity'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.58
#['Census_SystemVolumeTotalCapacity'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.77
#['Census_TotalPhysicalRAM'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.150000000000006
#['IsBeta'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.0
#['AutoSampleOptIn'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.9
#['LocaleEnglishNameIdentifier'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.61
#['Census_IsFlightsDisabled'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.53999999999999
#['Census_FirmwareManufacturerIdentifier'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.08
#['Census_FirmwareVersionIdentifier'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.370000000000005
#['Census_IsVirtualDevice'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.605
#['Census_IsAlwaysOnAlwaysConnectedCapable'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.745000000000005
#['Census_ThresholdOptIn'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.32
#['Census_IsWIMBootEnabled'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.775000000000006
#['Census_InternalBatteryNumberOfCharges'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.35
#['Census_OSSkuName'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.535
#['Census_ChassisTypeName'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.56
#['Census_OSBranch'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.5
#['Census_OSBuildNumber'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.845000000000006
#['Census_OSBuildRevision'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.529999999999994
#['Census_OSArchitecture'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.515
#['OsBuild'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 62.28
#['ProductName'] (180000, 96) (20000, 96) (180000,) (20000,) AdaBoostClassifier 61.660000000000004

In [460]:
# ProductName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.843
# IsBeta (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.82299999999999
# RtpStateBitfield (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.1
# IsSxsPassiveMode (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.007999999999996
# AVProductStatesIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.812
# AVProductsInstalled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.739999999999995
# AVProductsEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.726000000000006
# HasTpm (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.788
# CountryIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.824999999999996
# CityIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.033
# OrganizationIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.937
# GeoNameIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.17
# LocaleEnglishNameIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.695
# Platform (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.032
# Processor (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.975
# OsBuild (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.855000000000004
# OsSuite (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.788
# OsPlatformSubRelease (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.868
# SkuEdition (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.836999999999996
# IsProtected (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.912
# AutoSampleOptIn (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.805
# PuaMode (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.992000000000004
# SMode (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.044
# IeVerIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.91
# SmartScreen (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.769
# Firewall (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.061
# UacLuaenable (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.867000000000004
# Census_MDC2FormFactor (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.064
# Census_DeviceFamily (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.235
# Census_OEMNameIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.188
# Census_OEMModelIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.931999999999995
# Census_ProcessorCoreCount (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.60099999999999
# Census_ProcessorManufacturerIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.711000000000006
# Census_ProcessorModelIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.089000000000006
# Census_PrimaryDiskTotalCapacity (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.877
# Census_PrimaryDiskTypeName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.064
# Census_SystemVolumeTotalCapacity (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.244
# Census_HasOpticalDiskDrive (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.875
# Census_TotalPhysicalRAM (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.589000000000006
# Census_ChassisTypeName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.017999999999994
# Census_InternalPrimaryDiagonalDisplaySizeInInches (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.071
# Census_InternalPrimaryDisplayResolutionHorizontal (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.791
# Census_InternalPrimaryDisplayResolutionVertical (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.002
# Census_PowerPlatformRoleName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.772000000000006
# Census_InternalBatteryNumberOfCharges (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.058
# Census_OSArchitecture (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.68
# Census_OSBranch (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.763
# Census_OSBuildNumber (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.09400000000001
# Census_OSBuildRevision (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.088
# Census_OSEdition (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.94
# Census_OSSkuName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.002
# Census_OSInstallTypeName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.624
# Census_OSInstallLanguageIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.745000000000005
# Census_OSUILocaleIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.869
# Census_OSWUAutoUpdateOptionsName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.265
# Census_IsPortableOperatingSystem (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.085
# Census_GenuineStateName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.782000000000004
# Census_ActivationChannel (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.992000000000004
# Census_IsFlightsDisabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.19
# Census_FlightRing (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.839
# Census_ThresholdOptIn (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.91799999999999
# Census_FirmwareManufacturerIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.086
# Census_FirmwareVersionIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.980000000000004
# Census_IsSecureBootEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.899
# Census_IsWIMBootEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.181
# Census_IsVirtualDevice (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.017
# Census_IsTouchEnabled (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.849
# Census_IsPenCapable (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.948
# Census_IsAlwaysOnAlwaysConnectedCapable (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.012
# Wdft_IsGamer (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.692
# Wdft_RegionIdentifier (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.754
# EngineVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.983999999999995
# EngineVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.927
# EngineVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.714999999999996
# EngineVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.184
# AppVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.894000000000005
# AppVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.035
# AppVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.913
# AppVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.929
# AvSigVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.687000000000005
# AvSigVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.013
# AvSigVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.944
# AvSigVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.809000000000005
# OsVer_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.743
# OsVer_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.92100000000001
# OsVer_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.056999999999995
# OsVer_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.97899999999999
# OsBuildLab_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.166999999999994
# OsBuildLab_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.893
# OsBuildLab_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.076
# OsBuildLab_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.063
# OsBuildLab_5 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.016000000000005
# OsBuildLab_6 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.288
# Census_OSVersion_1 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.991
# Census_OSVersion_2 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.041000000000004
# Census_OSVersion_3 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.946999999999996
# Census_OSVersion_4 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 61.86000000000001

In [461]:
# OsBuildLab_6 (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.288
# Census_OSWUAutoUpdateOptionsName (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.265
# Census_SystemVolumeTotalCapacity (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.244
# Census_DeviceFamily (400000, 96) (100000, 96) (400000,) (100000,) AdaBoostClassifier 62.235

In [529]:
train_count = 250000 #int(len(full_features) * 0.8)

train_features = full_features.values[:train_count]
test_features  = full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print ("All columns (normalized)", train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", all_columns_score*100)


All columns (normalized) (250000, 96) (250000, 96) (250000,) (250000,) HistGradientBoostingClassifier 64.1392


In [530]:
model = PCA(n_components=80)
pca_train_results = np.array(model.fit_transform(normalized_train_features))
pca_test_results = np.array(model.transform(normalized_test_features))

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(pca_train_results, train_labels)
pca_all_columns_score = clf.score(pca_test_results, test_labels)
    
print ("All columns (PCA)", train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", pca_all_columns_score*100)


All columns (PCA) (250000, 96) (250000, 96) (250000,) (250000,) HistGradientBoostingClassifier 61.8156


In [563]:
columns_to_drop = [
    #'Census_OSArchitecture',
    'GeoNameIdentifier',
    'UacLuaenable',
    'Census_FirmwareVersionIdentifier',
    'IsProtected',
    #'OsSuite',
    'CityIdentifier',
    'Census_OEMModelIdentifier',
    
    # SAME
    'Census_ThresholdOptIn',
    'AutoSampleOptIn',
    'Census_IsFlightsDisabled',
    'IsBeta',
    'ProductName',
    'Census_IsWIMBootEnabled',
    'Census_DeviceFamily',
    'Census_OSBuildNumber',
    'Census_OSBuildRevision',
    'Platform',
    'Processor',
    'Census_IsPortableOperatingSystem',
    'Census_IsPenCapable',
    'OsBuild',
    'Census_ProcessorManufacturerIdentifier',
    'OsVer_1',
    'OsVer_2',
    'OsVer_3',
    'OsVer_4'
]

df_full_features = full_features.drop(columns_to_drop, axis=1)

train_features = df_full_features.values[:train_count]
test_features  = df_full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)
    
clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print (columns_to_drop)
print (train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", all_columns_score*100)

['GeoNameIdentifier', 'UacLuaenable', 'Census_FirmwareVersionIdentifier', 'IsProtected', 'CityIdentifier', 'Census_OEMModelIdentifier', 'Census_ThresholdOptIn', 'AutoSampleOptIn', 'Census_IsFlightsDisabled', 'IsBeta', 'ProductName', 'Census_IsWIMBootEnabled', 'Census_DeviceFamily', 'Census_OSBuildNumber', 'Census_OSBuildRevision', 'Platform', 'Processor', 'Census_IsPortableOperatingSystem', 'Census_IsPenCapable', 'OsBuild', 'Census_ProcessorManufacturerIdentifier', 'OsVer_1', 'OsVer_2', 'OsVer_3', 'OsVer_4']
(250000, 71) (250000, 71) (250000,) (250000,) HistGradientBoostingClassifier 64.1988


In [564]:
cols = [
    'AutoSampleOptIn',
    'Census_ProcessorCoreCount',
    'Census_IsFlightsDisabled',
    'IsBeta',
    'ProductName',
    'UacLuaenable',
    'Census_OEMNameIdentifier',
    'IsSxsPassiveMode',
    'OsBuildLab_6',
    'Census_OSArchitecture',
    'GeoNameIdentifier',
    'AVProductsInstalled',
    'Census_SystemVolumeTotalCapacity',
    'Census_DeviceFamily',
    'Census_InternalPrimaryDiagonalDisplaySizeInInches',
    'Census_OSEdition',
    'Census_PrimaryDiskTotalCapacity',
    'Census_SystemVolumeTotalCapacity',
    'Census_TotalPhysicalRAM',
    'LocaleEnglishNameIdentifier',
    'Census_FirmwareManufacturerIdentifier',
    'Census_FirmwareVersionIdentifier',
    'Census_IsAlwaysOnAlwaysConnectedCapable',
    'Census_IsWIMBootEnabled',
    'Census_InternalBatteryNumberOfCharges',
    'Census_OSSkuName',
    'Census_ChassisTypeName',
    'Census_OSBranch',
    'Census_OSBuildNumber',
    'Census_OSBuildRevision',
]

for c in cols:
    if c not in df_full_features.columns:
        continue
        
    df_features = df_full_features.drop(c, axis=1)

    train_features = df_features.values[:train_count]
    test_features  = df_features.values[train_count:]

    train_labels = full_labels.values[:train_count]
    test_labels = full_labels.values[train_count:]
    
    scaler = StandardScaler()
    scaler.fit(train_features)
    normalized_train_features = scaler.transform(train_features)
    normalized_test_features = scaler.transform(test_features)
    
    clf = ske.HistGradientBoostingClassifier(random_state=123)
    clf.fit(normalized_train_features, train_labels)
    score = clf.score(normalized_test_features, test_labels)
    
    print (c, train_features.shape, "HistGradientBoosting", score*100, score >= all_columns_score, score > all_columns_score)

Census_ProcessorCoreCount (250000, 70) HistGradientBoosting 64.1588 False False
Census_OEMNameIdentifier (250000, 70) HistGradientBoosting 64.1216 False False
IsSxsPassiveMode (250000, 70) HistGradientBoosting 64.2092 True True
OsBuildLab_6 (250000, 70) HistGradientBoosting 64.1352 False False
Census_OSArchitecture (250000, 70) HistGradientBoosting 64.1732 False False
AVProductsInstalled (250000, 70) HistGradientBoosting 64.0532 False False
Census_SystemVolumeTotalCapacity (250000, 70) HistGradientBoosting 64.066 False False
Census_InternalPrimaryDiagonalDisplaySizeInInches (250000, 70) HistGradientBoosting 64.0476 False False
Census_OSEdition (250000, 70) HistGradientBoosting 64.14760000000001 False False
Census_PrimaryDiskTotalCapacity (250000, 70) HistGradientBoosting 64.1688 False False
Census_SystemVolumeTotalCapacity (250000, 70) HistGradientBoosting 64.066 False False
Census_TotalPhysicalRAM (250000, 70) HistGradientBoosting 64.054 False False
LocaleEnglishNameIdentifier (250000

In [565]:
for c in df_full_features.columns:
    if c in cols:
        continue
    
    df_features = df_full_features.drop(c, axis=1)

    train_features = df_features.values[:train_count]
    test_features  = df_features.values[train_count:]

    train_labels = full_labels.values[:train_count]
    test_labels = full_labels.values[train_count:]
    
    scaler = StandardScaler()
    scaler.fit(train_features)
    normalized_train_features = scaler.transform(train_features)
    normalized_test_features = scaler.transform(test_features)
    
    clf = ske.HistGradientBoostingClassifier(random_state=123)
    clf.fit(normalized_train_features, train_labels)
    score = clf.score(normalized_test_features, test_labels)
    
    print (c, train_features.shape, "HistGradientBoosting", score*100, score >= all_columns_score, score > all_columns_score)

RtpStateBitfield (250000, 70) HistGradientBoosting 64.09280000000001 False False
AVProductStatesIdentifier (250000, 70) HistGradientBoosting 63.6484 False False
AVProductsEnabled (250000, 70) HistGradientBoosting 64.1828 False False
HasTpm (250000, 70) HistGradientBoosting 64.1164 False False
CountryIdentifier (250000, 70) HistGradientBoosting 64.0564 False False
OrganizationIdentifier (250000, 70) HistGradientBoosting 64.10679999999999 False False
OsSuite (250000, 70) HistGradientBoosting 64.17399999999999 False False
OsPlatformSubRelease (250000, 70) HistGradientBoosting 64.1584 False False
SkuEdition (250000, 70) HistGradientBoosting 64.1712 False False
SMode (250000, 70) HistGradientBoosting 64.14800000000001 False False
IeVerIdentifier (250000, 70) HistGradientBoosting 64.1132 False False
SmartScreen (250000, 70) HistGradientBoosting 63.1328 False False
Firewall (250000, 70) HistGradientBoosting 64.12440000000001 False False
Census_MDC2FormFactor (250000, 70) HistGradientBoosting 