The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.experimental import enable_hist_gradient_boosting
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [2]:
# set up display area to show dataframe in jupyter qtconsole
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [3]:
# We need to explicitly specify data types when reading csv, otherwise it is very memory consuming
# and we will get the warning "Specify dtype option on import or set low_memory=False"
# So, we will manually defined the data types

# P.S. I have loaded the sample data and exported train_data.dtypes
# these are the data types for fast loading

datatypes = {
    'ProductName': str,
    'EngineVersion': str,
    'AppVersion': str,
    'AvSigVersion': str,
    'IsBeta': np.int8,
    'RtpStateBitfield': str,
    'IsSxsPassiveMode': np.int8,
    'DefaultBrowsersIdentifier': str,
    'AVProductStatesIdentifier': str,
    'AVProductsInstalled': str,
    'AVProductsEnabled': str,
    'HasTpm': np.int8,
    'CountryIdentifier': str,
    'CityIdentifier': str,
    'OrganizationIdentifier': str,
    'GeoNameIdentifier': str,
    'LocaleEnglishNameIdentifier': str,
    'Platform': str,
    'Processor': str,
    'OsVer': str,
    'OsBuild': str,
    'OsSuite': str,
    'OsPlatformSubRelease': str,
    'OsBuildLab': str,
    'SkuEdition': str,
    'IsProtected': str,
    'AutoSampleOptIn': np.int8,
    'PuaMode': str,
    'SMode': str,
    'IeVerIdentifier': str,
    'SmartScreen': str,
    'Firewall': str,
    'UacLuaenable': str,
    'Census_MDC2FormFactor': str,
    'Census_DeviceFamily': str,
    'Census_OEMNameIdentifier': str,
    'Census_OEMModelIdentifier': str, 
    'Census_ProcessorCoreCount': str,
    'Census_ProcessorManufacturerIdentifier': str,
    'Census_ProcessorModelIdentifier': str,
    'Census_ProcessorClass': str,
    'Census_PrimaryDiskTotalCapacity': np.float64,
    'Census_PrimaryDiskTypeName': str,
    'Census_SystemVolumeTotalCapacity': np.float64,
    'Census_HasOpticalDiskDrive': np.int8,
    'Census_TotalPhysicalRAM': np.float64,
    'Census_ChassisTypeName': str,
    'Census_InternalPrimaryDiagonalDisplaySizeInInches': str,
    'Census_InternalPrimaryDisplayResolutionHorizontal': str,
    'Census_InternalPrimaryDisplayResolutionVertical': str,
    'Census_PowerPlatformRoleName': str,
    'Census_InternalBatteryType': str,
    'Census_InternalBatteryNumberOfCharges': str,
    'Census_OSVersion': str,
    'Census_OSArchitecture': str,
    'Census_OSBranch': str,
    'Census_OSBuildNumber': str,
    'Census_OSBuildRevision': str,
    'Census_OSEdition': str,
    'Census_OSSkuName': str,
    'Census_OSInstallTypeName': str,
    'Census_OSInstallLanguageIdentifier': str,
    'Census_OSUILocaleIdentifier': str,
    'Census_OSWUAutoUpdateOptionsName': str,
    'Census_IsPortableOperatingSystem': np.int8,
    'Census_GenuineStateName': str,
    'Census_ActivationChannel': str,
    'Census_IsFlightingInternal': str,
    'Census_IsFlightsDisabled': str,
    'Census_FlightRing': str,
    'Census_ThresholdOptIn': str,
    'Census_FirmwareManufacturerIdentifier': str,
    'Census_FirmwareVersionIdentifier': str,
    'Census_IsSecureBootEnabled': np.int8,
    'Census_IsWIMBootEnabled': str,
    'Census_IsVirtualDevice': str,
    'Census_IsTouchEnabled': np.int8,
    'Census_IsPenCapable': np.int8,
    'Census_IsAlwaysOnAlwaysConnectedCapable': str,
    'Wdft_IsGamer': str,
    'Wdft_RegionIdentifier': str,
    'HasDetections': np.int8
}

full_features = pd.read_csv("train.csv", dtype=datatypes, index_col="MachineIdentifier")

In [4]:
print (full_features.shape)

(180000, 82)


In [5]:
# Optional
# For speeding up the processes, we will shuffle the data and take only 200,000 rows. Otherwise it will take quite a bit

# Shuffle the data

shuffle = np.random.permutation(np.arange(full_features.shape[0]))[:200000]
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

print (full_features.shape)

(180000, 82)


In [6]:
# Checking the columns with the most NULL values
print((full_features.isnull().sum()).sort_values(ascending=False).head(20))

PuaMode                                  179953
Census_ProcessorClass                    179232
DefaultBrowsersIdentifier                171092
Census_IsFlightingInternal               149456
Census_InternalBatteryType               127786
Census_ThresholdOptIn                    114309
Census_IsWIMBootEnabled                  114145
SmartScreen                               64360
OrganizationIdentifier                    55367
SMode                                     10972
CityIdentifier                             6533
Wdft_IsGamer                               6255
Wdft_RegionIdentifier                      6255
Census_InternalBatteryNumberOfCharges      5382
Census_FirmwareManufacturerIdentifier      3704
Census_FirmwareVersionIdentifier           3237
Census_IsFlightsDisabled                   3201
Census_OEMModelIdentifier                  2065
Census_OEMNameIdentifier                   1926
Firewall                                   1826
dtype: int64


In [7]:
full_features['PuaMode'].unique()

array([nan, 'on'], dtype=object)

In [8]:
full_features['Census_IsFlightingInternal'].unique()

array([nan, '0.0'], dtype=object)

In [9]:
full_features['Census_InternalBatteryType'].unique()

array([nan, 'lion', '#', 'li-i', 'unkn', 'lip', 'li p', 'liio', 'li',
       'ithi', 'real', 'nimh', 'lgi0', 'bq20', 'vbox', 'pbac', '4cel',
       'ca48', 'lit', 'lipo', 'lipp', 'batt', 'lhp0', 'asmb', 'icp3',
       '4lio', 'ram', 'li-p'], dtype=object)

In [10]:
full_features['Census_ThresholdOptIn'].unique()

array(['0.0', nan, '1.0'], dtype=object)

In [11]:
full_features['Census_IsWIMBootEnabled'].unique()

array(['0.0', nan], dtype=object)

In [12]:
full_features['SMode'].unique()

array(['0.0', nan, '1.0'], dtype=object)

In [13]:
full_features['OrganizationIdentifier'].unique()

array(['27.0', nan, '18.0', '50.0', '5.0', '37.0', '48.0', '11.0', '49.0',
       '46.0', '2.0', '32.0', '28.0', '20.0', '14.0', '4.0', '31.0',
       '45.0', '51.0', '33.0', '42.0', '10.0', '36.0', '6.0', '52.0',
       '21.0', '40.0', '8.0', '19.0', '12.0', '1.0', '29.0', '39.0',
       '44.0', '35.0', '47.0', '3.0', '22.0', '26.0', '16.0'],
      dtype=object)

In [14]:
full_features['Wdft_IsGamer'].unique()

array(['0.0', '1.0', nan], dtype=object)

In [15]:
full_features['Wdft_RegionIdentifier'].unique()

array(['15.0', '10.0', '11.0', '7.0', '3.0', '1.0', nan, '8.0', '12.0',
       '5.0', '6.0', '13.0', '4.0', '2.0', '9.0', '14.0'], dtype=object)

In [16]:
full_features['CityIdentifier'].unique()

array(['82905.0', '87568.0', '110905.0', ..., '39492.0', '142668.0',
       '10660.0'], dtype=object)

In [17]:
full_features['Census_InternalBatteryNumberOfCharges'].unique()

array([nan, '4294967295.0', '0.0', ..., '820.0', '54931.0', '828.0'],
      dtype=object)

In [18]:
# Cleaning up some data

# PuaMode - Potentially Unwanted Applications, if NA, then it is disabled. 99% are NA. So, better to drop it
# Census_ProcessorClass - According to the description - "No longer maintained and updated"
# DefaultBrowsersIdentifier - Almost all values are empty. Therefore we will drop this column
# Census_IsFlightingInternal - whether this is internal or "external" testing ring. Column mostly unused. Will have to drop it
# Census_InternalBatteryType - comtains mostly garbage. Besides, it should not be relevant to attack surface.
# Census_ThresholdOptIn - also mostly unused. Googled it and Threshold was used in first versions of Windows 10. Looks like unused now
# Census_IsWIMBootEnabled - Is it possible to boot from Windows Image? Not relevant to identification of the attacks when 70% of data is emtpy
# SmartScreen - Whether smart screen in explorer is enabled. Should be important. "ExistsNotSet" when null, according to the description
# SMode - Quite relevant field. Will be keeping it
# OrganizationIdentifier - Attacks by organizations should be analyzed. If not filled, will assign "0". 
# Census_InternalBatteryNumberOfCharges - Not relevant. Will drop this column in order not to overtrain
# Census_OSSkuName -  OS edition friendly name (currently Windows only). - Can be removed. Duplicate field
# Census_ChassisTypeName - Census_MDC2FormFactor gives better information. Let's remove this field

#full_features['PuaMode'] = full_features['PuaMode'].fillna('off')
#full_features['SmartScreen'] = full_features['SmartScreen'].fillna('ExistsNotSet')
#full_features['SMode'] = full_features['SMode'].fillna('0').astype('int8')
#full_features['OrganizationIdentifier'] = full_features['OrganizationIdentifier'].fillna('0').astype('int32')
#full_features['Wdft_IsGamer'] = full_features['Wdft_IsGamer'].fillna('0').astype('int8')
#full_features['Wdft_RegionIdentifier'] = full_features['Wdft_RegionIdentifier'].fillna('0').astype('int32')
#full_features['CityIdentifier'] = full_features['CityIdentifier'].fillna('0').astype('int32')

#full_features = full_features.drop([
#    'PuaMode',
#    'Census_OSEdition',
#    'Census_ProcessorClass',
#    'DefaultBrowsersIdentifier',
#    'Census_IsFlightingInternal',
#    'Census_InternalBatteryType'], axis=1)

In [19]:
# Now let us check the string columns

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['ProductName',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'DefaultBrowsersIdentifier',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'CityIdentifier',
 'OrganizationIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'Platform',
 'Processor',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'IsProtected',
 'PuaMode',
 'SMode',
 'IeVerIdentifier',
 'SmartScreen',
 'Firewall',
 'UacLuaenable',
 'Census_MDC2FormFactor',
 'Census_DeviceFamily',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_ProcessorClass',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'C

In [20]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,ProductName,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_ProcessorClass,Census_PrimaryDiskTypeName,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1
007e53a6444ec04701a3adeb333a640f,win8defender,1.1.15200.1,4.18.1807.18075,1.275.445.0,7.0,2388.0,53447.0,1.0,1.0,164,82905.0,27.0,205.0,172,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.1805...,Pro,1.0,,0.0,117.0,ExistsNotSet,1.0,1.0,Desktop,Windows.Desktop,1980.0,317708.0,4.0,5.0,2560.0,,SSD,Desktop,20.1,1680.0,1050.0,Desktop,,,10.0.16299.19,amd64,rs3_release,16299,19,Professional,PROFESSIONAL,UUPUpgrade,27.0,120,Notify,IS_GENUINE,Retail,,0.0,Retail,0.0,142.0,34833.0,0.0,0.0,0.0,0.0,15.0
1997e7ac97564c000d606a8416f86e33,win8defender,1.1.15200.1,4.18.1807.18075,1.275.913.0,7.0,,7945.0,2.0,1.0,208,87568.0,27.0,240.0,233,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,,0.0,137.0,,1.0,1.0,Desktop,Windows.Desktop,1980.0,317708.0,2.0,5.0,3439.0,,HDD,Desktop,43.0,1768.0,992.0,Desktop,,4294967295.0,10.0.17134.228,amd64,rs4_release,17134,228,Professional,PROFESSIONAL,UUPUpgrade,9.0,34,FullAuto,INVALID_LICENSE,Retail,,0.0,Retail,,142.0,34558.0,,0.0,0.0,1.0,10.0
5a6a183023c33bf9532c260938d0f4de,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1826.0,7.0,,53447.0,1.0,1.0,91,110905.0,,125.0,75,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.1805...,Home,1.0,,0.0,117.0,RequireAdmin,1.0,1.0,Notebook,Windows.Desktop,1443.0,256665.0,4.0,5.0,2867.0,,HDD,Portable,15.5,1366.0,768.0,Mobile,,0.0,10.0.16299.547,amd64,rs3_release_svc_escrow,16299,547,Core,CORE,Update,8.0,31,AutoInstallAndRebootAtMaintenanceTime,IS_GENUINE,Retail,,0.0,Retail,,355.0,19956.0,,0.0,0.0,0.0,11.0
fa46e48f81b29a8931685274a07c6712,win8defender,1.1.15200.1,4.13.17134.1,1.275.1632.0,7.0,,53447.0,1.0,1.0,203,143770.0,18.0,255.0,46,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,,0.0,137.0,ExistsNotSet,1.0,1.0,Desktop,Windows.Desktop,1980.0,187292.0,12.0,1.0,1266.0,,SSD,Desktop,22.0,1680.0,1050.0,Desktop,,4294967295.0,10.0.17134.286,amd64,rs4_release,17134,286,Professional,PROFESSIONAL,IBSClean,39.0,160,UNKNOWN,INVALID_LICENSE,Retail,,0.0,Retail,,142.0,34103.0,,0.0,0.0,0.0,7.0
bea3c5a849fa17b7405c061dfed6310e,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1718.0,7.0,,5439.0,3.0,1.0,220,26777.0,,237.0,72,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.1805...,Home,1.0,,0.0,117.0,RequireAdmin,1.0,1.0,Notebook,Windows.Desktop,525.0,331207.0,2.0,5.0,1998.0,,HDD,Notebook,15.5,1366.0,768.0,Mobile,lion,100.0,10.0.16299.611,amd64,rs3_release_svc_escrow,16299,611,CoreSingleLanguage,CORE_SINGLELANGUAGE,UUPUpgrade,8.0,31,Notify,IS_GENUINE,OEM:DM,,0.0,Retail,0.0,142.0,69888.0,0.0,0.0,0.0,0.0,11.0
8787d2add74a659b150793b178d047f4,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1860.0,7.0,,53447.0,1.0,1.0,110,7035.0,27.0,211.0,182,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,,0.0,137.0,,1.0,1.0,Notebook,Windows.Desktop,2668.0,34481.0,4.0,5.0,2257.0,,HDD,Notebook,11.4,800.0,600.0,Mobile,,0.0,10.0.17134.228,amd64,rs4_release,17134,228,Core,CORE,IBSClean,29.0,125,FullAuto,INVALID_LICENSE,Retail,,0.0,Retail,,628.0,9993.0,,0.0,0.0,0.0,3.0
8c0b60cdb5c58e219eb3d5c87dfbb5a0,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1107.0,7.0,,53447.0,1.0,1.0,108,75425.0,,140.0,75,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,,0.0,137.0,RequireAdmin,1.0,1.0,Desktop,Windows.Desktop,1443.0,275864.0,4.0,5.0,2624.0,,HDD,MiniTower,20.0,1600.0,900.0,Desktop,,4294967295.0,10.0.17134.228,amd64,rs4_release,17134,228,Professional,PROFESSIONAL,UUPUpgrade,8.0,31,FullAuto,IS_GENUINE,Retail,,0.0,Retail,,355.0,19948.0,,0.0,0.0,0.0,11.0
dc363d7018cf153417d819ca9014c405,win8defender,1.1.15200.1,4.18.1807.18075,1.275.497.0,0.0,,10501.0,4.0,2.0,124,29356.0,,277.0,75,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,,0.0,137.0,RequireAdmin,1.0,1.0,Desktop,Windows.Desktop,1443.0,275890.0,4.0,5.0,2273.0,,SSD,Desktop,19.4,1600.0,900.0,Desktop,,4294967295.0,10.0.17134.228,amd64,rs4_release,17134,228,Professional,PROFESSIONAL,UUPUpgrade,8.0,31,FullAuto,IS_GENUINE,Retail,,0.0,Retail,,355.0,19970.0,,0.0,0.0,0.0,3.0
1462359bd0df264b1f3c08dbeb249ebc,win8defender,1.1.14700.5,4.8.10240.16384,1.265.228.0,7.0,,53447.0,1.0,1.0,151,78182.0,27.0,277.0,75,windows10,x64,10.0.0.0,10240,256,th1,10240.16384.amd64fre.th1.150709-1700,Enterprise,1.0,,0.0,41.0,RequireAdmin,1.0,1.0,Notebook,Windows.Desktop,1443.0,256585.0,4.0,5.0,2321.0,,HDD,Portable,15.3,1366.0,768.0,Mobile,#,4294967295.0,10.0.10240.16384,amd64,th1,10240,16384,Enterprise,ENTERPRISE,IBSClean,8.0,31,UNKNOWN,IS_GENUINE,Volume:GVLK,0.0,0.0,NOT_SET,0.0,355.0,19948.0,0.0,0.0,0.0,0.0,1.0
dac5a4c9ed6813cdf58ccced6be81dcf,win8defender,1.1.15200.1,4.12.16299.15,1.275.1606.0,7.0,,53447.0,1.0,1.0,214,,50.0,277.0,75,windows10,x64,10.0.0.0,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1.0,,0.0,111.0,,1.0,1.0,Notebook,Windows.Desktop,2668.0,172108.0,8.0,5.0,2737.0,,HDD,Notebook,15.5,1920.0,1080.0,Mobile,,0.0,10.0.16299.15,amd64,rs3_release,16299,15,CoreSingleLanguage,CORE_SINGLELANGUAGE,IBSClean,8.0,31,UNKNOWN,IS_GENUINE,OEM:DM,,0.0,Retail,,628.0,15894.0,,0.0,0.0,0.0,1.0


At first glance at the data, it becomes obvious, that the stings are either classifiers, or versions that contain 4 classifiers in them. So. in order to use the algorithms that support only numeric values we will convert classifiers like "ProductName" to integer range and the fields like AppVersion

In [21]:
def df_replacevalues(df, colname, oldvalues, newvalues, topvalue):
    # First, we need to get the most frequent value of the column
    #topvalue = df[colname].value_counts().idxmax() # Decided yo specify explicitly, so commenting out
    
    # Replace NaN values with the popular value
    df[colname].fillna(topvalue, inplace=True)
    
    # We need to make sure no other value than oldvalues exists
    indexes = df[~df[colname].isin(oldvalues)].index
    
    # If the "Garbage" values are more than 1%, then raise an error
    if len(indexes) > len(df) / 100:
        raise Exception("Not all neccessary values are present in oldvalues array")
    
    # Replace "Garbage" with the top value
    df.loc[indexes,[colname]] = topvalue
    
    print ("Previous values", df[colname].unique())
    df[colname] = pd.to_numeric(df[colname].replace(oldvalues, newvalues), errors='raise', downcast='integer')
    print ("New values", df[colname].unique())
    
#full_features["Platform"].unique()
#full_features["Platform"].value_counts()
#full_features[~full_features["ProductName"].isin(['win8defender', 'mse'])].index

Standard convertor accuracy was only 57% when the feature values of data type string. 

Replaced all the string values in the following features to numbers and used OrdinalEncoder, which increased the accuracy 5-10%.

- ProductName
- Platform
- Processor
- OsPlatformSubRelease
- SkuEdition
- SmartScreen
- Census_MDC2FormFactor
- Census_DeviceFamily
- Census_PrimaryDiskTypeName
- Census_ChassisTypeName
- Census_PowerPlatformRoleName
- Census_OSArchitecture
- Census_OSBranch
- Census_OSSkuName
- Census_MDC2FormFactor
- Census_DeviceFamily
- Census_PrimaryDiskTypeName
- Census_ChassisTypeName
- Census_PowerPlatformRoleName
- Census_OSArchitecture
- Census_OSBranch
- Census_OSSkuName
- Census_OSInstallTypeName
- Census_OSWUAutoUpdateOptionsName
- Census_InternalBatteryType
- Census_GenuineStateName
- Census_ActivationChannel
- Census_FlightRing
- Census_OSEdition

In [22]:
print(full_features["ProductName"].value_counts())

colname = "ProductName"
oldvalues = ['win8defender','mse','mseprerelease']
newvalues = [1,2,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'win8defender')

win8defender     178127
mse                1869
mseprerelease         4
Name: ProductName, dtype: int64
Previous values ['win8defender' 'mse' 'mseprerelease']
New values [1 2]


In [23]:
print(full_features["Platform"].value_counts())

colname = "Platform"
oldvalues = ['windows10','windows7','windows8','windows2016','Undefined']
newvalues = [10,7,8,2016,-1]

df_replacevalues(full_features, colname, oldvalues, newvalues,'Undefined')

windows10      173941
windows8         3910
windows7         1845
windows2016       304
Name: Platform, dtype: int64
Previous values ['windows10' 'windows7' 'windows8' 'windows2016']
New values [  10    7    8 2016]


In [24]:
print(full_features["Processor"].value_counts())

colname = "Processor"
oldvalues = ['x64','arm64','x86']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'x64')

x64      163599
x86       16394
arm64         7
Name: Processor, dtype: int64
Previous values ['x64' 'x86' 'arm64']
New values [1 3 2]


In [25]:
colname = "OsPlatformSubRelease"

print(full_features[colname].value_counts())

oldvalues = ['rs4','rs3','rs2','rs1','windows7','windows8.1','th1','th2','prers5','Unknown']
newvalues = [504,503,502,501,        407,408,                201,202,     505,     0]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Unknown')

rs4           78931
rs3           50578
rs2           15957
rs1           14773
th2            8119
th1            5462
windows8.1     3910
windows7       1845
prers5          425
Name: OsPlatformSubRelease, dtype: int64
Previous values ['rs3' 'rs4' 'th1' 'rs1' 'rs2' 'th2' 'windows7' 'windows8.1' 'prers5']
New values [503 504 201 501 502 202 407 408 505]


In [26]:
colname = "SkuEdition"

print(full_features[colname].value_counts())

oldvalues = ['Pro','Home','Invalid','Enterprise LTSB','Enterprise','Education','Cloud','Server']
newvalues = [55,52,0,71,70,20,90,80]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Invalid')

Home               110835
Pro                 65462
Invalid              1583
Education             812
Enterprise            696
Enterprise LTSB       423
Cloud                 120
Server                 69
Name: SkuEdition, dtype: int64
Previous values ['Pro' 'Home' 'Enterprise' 'Education' 'Invalid' 'Enterprise LTSB'
 'Server' 'Cloud']
New values [55 52 70 20  0 71 80 90]


In [27]:
colname = "SmartScreen"

print(full_features[colname].value_counts())

oldvalues = ['Off','off','OFF','On','on','Warn','Prompt','ExistsNotSet','Block','RequireAdmin']
newvalues = [0,0,0,1,1,2,3,4,5,6]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'ExistsNotSet')

RequireAdmin    86940
ExistsNotSet    21019
Off              3718
Warn             2754
Prompt            695
Block             446
off                36
On                 12
&#x01;              7
&#x02;              7
on                  6
Name: SmartScreen, dtype: int64
Previous values ['ExistsNotSet' 'RequireAdmin' 'Warn' 'Prompt' 'Off' 'Block' 'off' 'On'
 'on']
New values [4 6 2 3 0 5 1]


In [28]:
colname = "Census_MDC2FormFactor"

print(full_features[colname].value_counts())

oldvalues = ['Desktop','Notebook','Detachable','PCOther','AllInOne','Convertible','SmallTablet','LargeTablet','SmallServer','LargeServer','MediumServer','ServerOther','Other']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

Notebook        115207
Desktop          39669
Convertible       8109
Detachable        6078
AllInOne          5914
PCOther           2817
LargeTablet       1303
SmallTablet        621
SmallServer        186
MediumServer        75
LargeServer         19
ServerOther          2
Name: Census_MDC2FormFactor, dtype: int64
Previous values ['Desktop' 'Notebook' 'LargeTablet' 'Detachable' 'Convertible' 'AllInOne'
 'PCOther' 'SmallTablet' 'SmallServer' 'LargeServer' 'MediumServer'
 'ServerOther']
New values [ 1  2  8  3  6  5  4  7  9 10 11 12]


In [29]:
# Census_DeviceFamily ['Windows.Desktop' 'Windows.Server' 'Windows']

colname = "Census_DeviceFamily"

print(full_features[colname].value_counts())

oldvalues = ['Windows.Desktop','Windows.Server','Windows']
#newvalues = [i+1 for i in range(len(oldvalues))]
# Windows = Windows.Desktop
newvalues = [1,2,1]
    
df_replacevalues(full_features, colname, oldvalues, newvalues, 'Windows.Desktop')

Windows.Desktop    179696
Windows.Server        304
Name: Census_DeviceFamily, dtype: int64
Previous values ['Windows.Desktop' 'Windows.Server']
New values [1 2]


In [30]:
# Census_PrimaryDiskTypeName ['HDD' 'SSD' 'UNKNOWN' 'Unspecified' nan]

colname = "Census_PrimaryDiskTypeName"

print(full_features[colname].value_counts())

oldvalues = ['HDD','SSD','UNKNOWN','Unspecified']
newvalues = [1,2,3,3]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Unspecified')

HDD            117029
SSD             49818
UNKNOWN          7337
Unspecified      5580
Name: Census_PrimaryDiskTypeName, dtype: int64
Previous values ['SSD' 'HDD' 'UNKNOWN' 'Unspecified']
New values [2 1 3]


In [31]:
# Census_ChassisTypeName Index(['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 'RackMountChassis', 'SubNotebook', 'BusExpansionChassis', '30', 'StickPC', '0', 'MultisystemChassis', 'Blade', '35', 'PizzaBox', 'SealedCasePC', 'SubChassis', 'ExpansionChassis', '31', '32', '88', '127', '25', '44', '36', 'DockingStation', 'BladeEnclosure', 'CompactPCI', '81', '45', 'EmbeddedPC', '28', '82', '112', 'IoTGateway', '49', '76', '39'], dtype='object')

colname = "Census_ChassisTypeName"

print(full_features[colname].value_counts())

oldvalues = ['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 
             'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 
             'RackMountChassis', 'SubNotebook', 'BusExpansionChassis']
# Grouping Laptop/Notebook, unknown and other
newvalues = [1,2,1,3,4,5,6,0,0,7,
             8,9,10,11,12,0,13,14,15,
             16,1,17]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

Notebook               105645
Desktop                 37860
Laptop                  13871
Portable                 7208
AllinOne                 4190
Convertible              1734
MiniTower                1691
Other                    1569
UNKNOWN                  1367
Detachable               1072
LowProfileDesktop        1066
HandHeld                  907
SpaceSaving               570
Tablet                    274
Tower                     240
Unknown                   222
MainServerChassis         201
MiniPC                     98
LunchBox                   89
RackMountChassis           73
SubNotebook                15
BusExpansionChassis         9
30                          5
0                           2
88                          2
MultisystemChassis          2
SealedCasePC                2
35                          1
32                          1
PizzaBox                    1
Name: Census_ChassisTypeName, dtype: int64
Previous values ['Desktop' 'Portable' 'Notebook' 'MiniTow

In [32]:
# Census_PowerPlatformRoleName Index(['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']

colname = "Census_PowerPlatformRoleName"

print(full_features[colname].value_counts())

oldvalues = ['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']
newvalues = [1,2,3,4,5,0,6,7,8,0]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

Mobile               124402
Desktop               41981
Slate                  9975
Workstation            2174
SOHOServer              790
UNKNOWN                 445
EnterpriseServer        148
AppliancePC              83
PerformanceServer         1
Name: Census_PowerPlatformRoleName, dtype: int64
Previous values ['Desktop' 'Mobile' 'Slate' 'Workstation' 'SOHOServer' 'EnterpriseServer'
 'UNKNOWN' 'AppliancePC' 'PerformanceServer']
New values [2 1 3 4 5 6 0 7 8]


In [33]:
# Census_OSArchitecture Index(['amd64', 'x86', 'arm64'], dtype='object')

colname = "Census_OSArchitecture"

print(full_features[colname].value_counts())

oldvalues = ['amd64', 'x86', 'arm64']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'amd64')

amd64    163583
x86       16410
arm64         7
Name: Census_OSArchitecture, dtype: int64
Previous values ['amd64' 'x86' 'arm64']
New values [1 3 2]


In [34]:
# Census_OSBranch Index(['rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 'rs2_release', 'rs1_release', 'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'rs5_release', 'rs3_release_svc_escrow_im', 'rs_prerelease', 'rs_prerelease_flt', 'rs5_release_sigma', 'rs1_release_srvmedia', 'winblue_ltsb_escrow', 'win7sp1_ldr', 'winblue_ltsb', 'win8_gdr', 'rs_xbox', 'rs5_release_edge', 'rs5_release_sigma_dev', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'rs_shell', 'rs1_release_svc', 'win8_ldr', 'rs_onecore_base_cobalt', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs3_release_svc', 'Khmer OS'], dtype='object')

colname = "Census_OSBranch"

print(full_features[colname].value_counts())

oldvalues = ['rs5_release', 'rs5_release_sigma', 'rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 
             'rs3_release_svc_escrow_im', 'rs2_release', 'rs1_release', 'rs_prerelease', 'rs_prerelease_flt', 
             'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'Undefined']
newvalues = [25,25,24,23,23,23,22,21,20,20,
             12,12,11,11,0]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Undefined')

rs4_release                  80826
rs3_release                  25207
rs3_release_svc_escrow       23996
rs2_release                  16309
rs1_release                  15886
th2_release                   6511
th2_release_sec               5215
th1_st1                       3975
th1                           1514
rs5_release                    297
rs3_release_svc_escrow_im      126
rs_prerelease                   69
rs_prerelease_flt               68
rs1_release_srvmedia             1
Name: Census_OSBranch, dtype: int64
Previous values ['rs3_release' 'rs4_release' 'rs3_release_svc_escrow' 'th1' 'rs1_release'
 'rs2_release' 'th2_release' 'th1_st1' 'th2_release_sec' 'rs5_release'
 'rs3_release_svc_escrow_im' 'rs_prerelease_flt' 'rs_prerelease'
 'Undefined']
New values [23 24 11 21 22 12 25 20  0]


In [35]:
# Census_OSSkuName Index(['CORE', 'PROFESSIONAL', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'EDUCATION', 'ENTERPRISE', 'PROFESSIONAL_N', 'ENTERPRISE_S', 'STANDARD_SERVER', 'CLOUD', 'CORE_N', 'STANDARD_EVALUATION_SERVER', 'EDUCATION_N', 'ENTERPRISE_S_N', 'DATACENTER_EVALUATION_SERVER', 'SB_SOLUTION_SERVER', 'ENTERPRISE_N', 'PRO_WORKSTATION', 'UNLICENSED', 'DATACENTER_SERVER', 'PRO_WORKSTATION_N', 'CLOUDN', 'PRO_CHINA', 'SERVERRDSH', 'ULTIMATE', 'PRO_FOR_EDUCATION', 'PRO_SINGLE_LANGUAGE', 'UNDEFINED', 'STARTER', 'ENTERPRISEG'], dtype='object')

colname = "Census_OSSkuName"
oldvalues = ['CORE', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'CORE_N',
             'EDUCATION', 'EDUCATION_N',
             'PROFESSIONAL', 'PROFESSIONAL_N', 'PRO_WORKSTATION',
             'ENTERPRISE',  'ENTERPRISE_S', 'ENTERPRISE_S_N', 'ENTERPRISE_N', 
             'CLOUD',
             'SB_SOLUTION_SERVER', 'STANDARD_SERVER', 'STANDARD_EVALUATION_SERVER', 'DATACENTER_EVALUATION_SERVER', 'UNLICENSED']
newvalues = [i+1 for i in range(len(oldvalues))]

# Group this feature by values

full_features['CORE'] = 1 if 'CORE' in full_features['Census_OSSkuName'] else 0
full_features['EDUCATION'] = 1 if 'EDUCATION' in full_features['Census_OSSkuName'] else 0
full_features['PRO'] = 1 if 'PRO' in full_features['Census_OSSkuName'] else 0
full_features['ENTERPRISE'] = 1 if 'ENTERPRISE' in full_features['Census_OSSkuName'] else 0
full_features['CLOUD'] = 1 if 'CLOUD' in full_features['Census_OSSkuName'] else 0
full_features['SERVER'] = 1 if 'SERVER' in full_features['Census_OSSkuName'] else 0
full_features['EVALUATION'] = 1 if 'EVALUATION' in full_features['Census_OSSkuName'] else 0

full_features.drop([colname], axis=1, inplace=True)


In [36]:
# Census_OSInstallTypeName Index(['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh'], dtype='object')

colname = "Census_OSInstallTypeName"

print(full_features[colname].value_counts())

oldvalues = ['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

UUPUpgrade        52670
IBSClean          33381
Update            32067
Upgrade           25038
Other             17038
Reset             13083
Refresh            4155
Clean              1466
CleanPCRefresh     1102
Name: Census_OSInstallTypeName, dtype: int64
Previous values ['UUPUpgrade' 'Update' 'IBSClean' 'Upgrade' 'Other' 'Reset' 'Refresh'
 'Clean' 'CleanPCRefresh']
New values [1 3 2 4 5 6 7 8 9]


In [37]:
# Census_OSWUAutoUpdateOptionsName Index(['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify'], dtype='object')

colname = "Census_OSWUAutoUpdateOptionsName"

print(full_features[colname].value_counts())

oldvalues = ['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

FullAuto                                 79875
UNKNOWN                                  50985
Notify                                   40899
AutoInstallAndRebootAtMaintenanceTime     7392
Off                                        545
DownloadNotify                             304
Name: Census_OSWUAutoUpdateOptionsName, dtype: int64
Previous values ['Notify' 'FullAuto' 'AutoInstallAndRebootAtMaintenanceTime' 'UNKNOWN'
 'DownloadNotify' 'Off']
New values [3 1 4 2 6 5]


In [38]:
colname = "Census_InternalBatteryType"

print(full_features[colname].value_counts())

oldvalues = ['lion', 'li-i', '#', 'lip', 'unkn']
newvalues = [1,1,1,1,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'unkn')

lion    40923
li-i     4920
#        3820
lip      1261
liio      643
li p      176
li        153
nimh       96
bq20       57
real       55
pbac       41
vbox       24
unkn        9
lgi0        9
lipo        5
4cel        5
lhp0        2
lipp        2
lit         2
batt        2
ithi        2
asmb        2
li-p        1
4lio        1
icp3        1
ram         1
ca48        1
Name: Census_InternalBatteryType, dtype: int64
Previous values ['unkn' 'lion' '#' 'li-i' 'lip']
New values [2 1]


In [39]:
# Census_GenuineStateName Index(['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED'], dtype='object')

colname = "Census_GenuineStateName"

print(full_features[colname].value_counts())

oldvalues = ['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED']
newvalues = [1,2,3,4,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

IS_GENUINE         158645
INVALID_LICENSE     16257
OFFLINE              4802
UNKNOWN               296
Name: Census_GenuineStateName, dtype: int64
Previous values ['IS_GENUINE' 'INVALID_LICENSE' 'OFFLINE' 'UNKNOWN']
New values [1 2 3 4]


In [40]:
# Census_ActivationChannel Index(['Retail', 'OEM:DM', 'Volume:GVLK', 'OEM:NONSLP', 'Volume:MAK', 'Retail:TB:Eval'], dtype='object')

#Assigning separate values for Retail, OEM and Volume channels
colname = "Census_ActivationChannel"

print(full_features[colname].value_counts())

oldvalues = ['Retail', 'Retail:TB:Eval', 'OEM:DM', 'OEM:NONSLP', 'Volume:GVLK', 'Volume:MAK', 'Other']
#newvalues = [i+1 for i in range(len(oldvalues))]
newvalues = [1,1,2,2,3,3,4]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

Retail            95693
OEM:DM            68412
Volume:GVLK        9255
OEM:NONSLP         6409
Volume:MAK          162
Retail:TB:Eval       69
Name: Census_ActivationChannel, dtype: int64
Previous values ['Retail' 'OEM:DM' 'Volume:GVLK' 'OEM:NONSLP' 'Volume:MAK'
 'Retail:TB:Eval']
New values [1 2 3]


In [41]:
full_features['Census_FlightRing'].value_counts()

Retail      168638
NOT_SET       5831
Unknown       4827
WIS            248
RP             189
WIF            182
Disabled        85
Name: Census_FlightRing, dtype: int64

In [42]:
# Census_FlightRing Index(['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary'], dtype='object')

colname = "Census_FlightRing"

print(full_features[colname].value_counts())

oldvalues = ['Retail', 'NOT_SET', 'Disabled', 'Unknown']
newvalues = [1,2,2,3]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Unknown')

Retail      168638
NOT_SET       5831
Unknown       4827
WIS            248
RP             189
WIF            182
Disabled        85
Name: Census_FlightRing, dtype: int64
Previous values ['Retail' 'NOT_SET' 'Unknown' 'Disabled']
New values [1 2 3]


In [43]:
# Census_FlightRing Index(['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary'], dtype='object')

colname = "Census_OSEdition"

print(full_features[colname].value_counts())

oldvalues = ['Core','CoreSingleLanguage','CoreCountrySpecific','CoreN',
             'Professional','ProfessionalN','ProfessionalEducation','ProfessionalEducationN',
             'Education','EducationN',
             'Enterprise','EnterpriseS','EnterpriseSN','EnterpriseN',
             'ServerStandard','ServerStandardEval','ServerDatacenterEval','ServerSolution',
             'Cloud',
             'Other']
newvalues = [1,1,1,1,
             2,2,2,2,
             3,3,
             4,4,4,4,
             5,5,5,5,
             6,
             7]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

Core                       69816
Professional               63575
CoreSingleLanguage         39013
CoreCountrySpecific         3360
ProfessionalEducation       1111
Education                    822
Enterprise                   713
ProfessionalN                580
EnterpriseS                  408
ServerStandard               222
Cloud                        141
CoreN                        102
ServerStandardEval            52
EducationN                    18
EnterpriseSN                  18
ServerDatacenterEval          17
EnterpriseN                   15
ServerSolution                13
ProfessionalEducationN         2
CloudN                         1
ProfessionalWorkstation        1
Name: Census_OSEdition, dtype: int64
Previous values ['Professional' 'Core' 'CoreSingleLanguage' 'Enterprise'
 'ProfessionalEducation' 'ProfessionalN' 'CoreCountrySpecific' 'Education'
 'ServerStandard' 'CoreN' 'EnterpriseS' 'ServerStandardEval' 'Cloud'
 'EnterpriseN' 'EnterpriseSN' 'ServerDatacenterEval' 

In [44]:
# PuaMode Index(['off', 'on', 'audit'], dtype='object')

#colname = "PuaMode"

#print(full_features[colname].value_counts())

#oldvalues = ['off', 'on', 'audit']
#newvalues = [0,1,2]

#df_replacevalues(full_features, colname, oldvalues, newvalues)

full_features.drop(['PuaMode','Census_ProcessorClass','DefaultBrowsersIdentifier'], axis=1, inplace=True)

In [45]:
# Now let us check the string columns again

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'CityIdentifier',
 'OrganizationIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsBuildLab',
 'IsProtected',
 'SMode',
 'IeVerIdentifier',
 'Firewall',
 'UacLuaenable',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_InternalBatteryNumberOfCharges',
 'Census_OSVersion',
 'Census_OSBuildNumber',
 'Census_OSBuildRevision',
 'Census_OSInstallLanguageIdentifier',
 'Census_OSUILocaleIdentifier',
 'Census_IsFlightingInternal',
 'Census_IsFlightsDisabled',
 'Census_ThresholdOptIn',
 'Cens

In [46]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,OsVer,OsBuild,OsSuite,OsBuildLab,IsProtected,SMode,IeVerIdentifier,Firewall,UacLuaenable,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1
007e53a6444ec04701a3adeb333a640f,1.1.15200.1,4.18.1807.18075,1.275.445.0,7.0,53447.0,1.0,1.0,164,82905.0,27.0,205.0,172,10.0.0.0,16299,256,16299.431.amd64fre.rs3_release_svc_escrow.1805...,1.0,0.0,117.0,1.0,1.0,1980.0,317708.0,4.0,5.0,2560.0,20.1,1680.0,1050.0,,10.0.16299.19,16299,19,27.0,120,,0.0,0.0,142.0,34833.0,0.0,0.0,0.0,0.0,15.0
1997e7ac97564c000d606a8416f86e33,1.1.15200.1,4.18.1807.18075,1.275.913.0,7.0,7945.0,2.0,1.0,208,87568.0,27.0,240.0,233,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1.0,0.0,137.0,1.0,1.0,1980.0,317708.0,2.0,5.0,3439.0,43.0,1768.0,992.0,4294967295.0,10.0.17134.228,17134,228,9.0,34,,0.0,,142.0,34558.0,,0.0,0.0,1.0,10.0
5a6a183023c33bf9532c260938d0f4de,1.1.15100.1,4.18.1807.18075,1.273.1826.0,7.0,53447.0,1.0,1.0,91,110905.0,,125.0,75,10.0.0.0,16299,768,16299.431.amd64fre.rs3_release_svc_escrow.1805...,1.0,0.0,117.0,1.0,1.0,1443.0,256665.0,4.0,5.0,2867.0,15.5,1366.0,768.0,0.0,10.0.16299.547,16299,547,8.0,31,,0.0,,355.0,19956.0,,0.0,0.0,0.0,11.0
fa46e48f81b29a8931685274a07c6712,1.1.15200.1,4.13.17134.1,1.275.1632.0,7.0,53447.0,1.0,1.0,203,143770.0,18.0,255.0,46,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1.0,0.0,137.0,1.0,1.0,1980.0,187292.0,12.0,1.0,1266.0,22.0,1680.0,1050.0,4294967295.0,10.0.17134.286,17134,286,39.0,160,,0.0,,142.0,34103.0,,0.0,0.0,0.0,7.0
bea3c5a849fa17b7405c061dfed6310e,1.1.15200.1,4.18.1807.18075,1.275.1718.0,7.0,5439.0,3.0,1.0,220,26777.0,,237.0,72,10.0.0.0,16299,768,16299.431.amd64fre.rs3_release_svc_escrow.1805...,1.0,0.0,117.0,1.0,1.0,525.0,331207.0,2.0,5.0,1998.0,15.5,1366.0,768.0,100.0,10.0.16299.611,16299,611,8.0,31,,0.0,0.0,142.0,69888.0,0.0,0.0,0.0,0.0,11.0
8787d2add74a659b150793b178d047f4,1.1.15100.1,4.18.1807.18075,1.273.1860.0,7.0,53447.0,1.0,1.0,110,7035.0,27.0,211.0,182,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1.0,0.0,137.0,1.0,1.0,2668.0,34481.0,4.0,5.0,2257.0,11.4,800.0,600.0,0.0,10.0.17134.228,17134,228,29.0,125,,0.0,,628.0,9993.0,,0.0,0.0,0.0,3.0
8c0b60cdb5c58e219eb3d5c87dfbb5a0,1.1.15200.1,4.18.1807.18075,1.275.1107.0,7.0,53447.0,1.0,1.0,108,75425.0,,140.0,75,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1.0,0.0,137.0,1.0,1.0,1443.0,275864.0,4.0,5.0,2624.0,20.0,1600.0,900.0,4294967295.0,10.0.17134.228,17134,228,8.0,31,,0.0,,355.0,19948.0,,0.0,0.0,0.0,11.0
dc363d7018cf153417d819ca9014c405,1.1.15200.1,4.18.1807.18075,1.275.497.0,0.0,10501.0,4.0,2.0,124,29356.0,,277.0,75,10.0.0.0,17134,256,17134.1.amd64fre.rs4_release.180410-1804,1.0,0.0,137.0,1.0,1.0,1443.0,275890.0,4.0,5.0,2273.0,19.4,1600.0,900.0,4294967295.0,10.0.17134.228,17134,228,8.0,31,,0.0,,355.0,19970.0,,0.0,0.0,0.0,3.0
1462359bd0df264b1f3c08dbeb249ebc,1.1.14700.5,4.8.10240.16384,1.265.228.0,7.0,53447.0,1.0,1.0,151,78182.0,27.0,277.0,75,10.0.0.0,10240,256,10240.16384.amd64fre.th1.150709-1700,1.0,0.0,41.0,1.0,1.0,1443.0,256585.0,4.0,5.0,2321.0,15.3,1366.0,768.0,4294967295.0,10.0.10240.16384,10240,16384,8.0,31,0.0,0.0,0.0,355.0,19948.0,0.0,0.0,0.0,0.0,1.0
dac5a4c9ed6813cdf58ccced6be81dcf,1.1.15200.1,4.12.16299.15,1.275.1606.0,7.0,53447.0,1.0,1.0,214,,50.0,277.0,75,10.0.0.0,16299,768,16299.15.amd64fre.rs3_release.170928-1534,1.0,0.0,111.0,1.0,1.0,2668.0,172108.0,8.0,5.0,2737.0,15.5,1920.0,1080.0,0.0,10.0.16299.15,16299,15,8.0,31,,0.0,,628.0,15894.0,,0.0,0.0,0.0,1.0


In [47]:
# Now we need to process the columns that contain version numbers
# We will split them in 4-5 different columns

versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']
newcolumnnames = []

for colname in versions:
    data = full_features[colname].str.split(r"\.|-",expand=True) # Split if '.' or '-'
    for i in range(data.shape[1]):
        newcolumnname = "%s_%d" % (colname, i+1)
        newcolumnnames.append(newcolumnname)
        full_features[newcolumnname] = data[i]

In [48]:
full_features[newcolumnnames].head(10)

Unnamed: 0_level_0,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
007e53a6444ec04701a3adeb333a640f,1,1,15200,1,4,18,1807,18075,1,275,445,0,10,0,0,0,16299,431,amd64fre,rs3_release_svc_escrow,180502,1908,10,0,16299,19
1997e7ac97564c000d606a8416f86e33,1,1,15200,1,4,18,1807,18075,1,275,913,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
5a6a183023c33bf9532c260938d0f4de,1,1,15100,1,4,18,1807,18075,1,273,1826,0,10,0,0,0,16299,431,amd64fre,rs3_release_svc_escrow,180502,1908,10,0,16299,547
fa46e48f81b29a8931685274a07c6712,1,1,15200,1,4,13,17134,1,1,275,1632,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,286
bea3c5a849fa17b7405c061dfed6310e,1,1,15200,1,4,18,1807,18075,1,275,1718,0,10,0,0,0,16299,431,amd64fre,rs3_release_svc_escrow,180502,1908,10,0,16299,611
8787d2add74a659b150793b178d047f4,1,1,15100,1,4,18,1807,18075,1,273,1860,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
8c0b60cdb5c58e219eb3d5c87dfbb5a0,1,1,15200,1,4,18,1807,18075,1,275,1107,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
dc363d7018cf153417d819ca9014c405,1,1,15200,1,4,18,1807,18075,1,275,497,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
1462359bd0df264b1f3c08dbeb249ebc,1,1,14700,5,4,8,10240,16384,1,265,228,0,10,0,0,0,10240,16384,amd64fre,th1,150709,1700,10,0,10240,16384
dac5a4c9ed6813cdf58ccced6be81dcf,1,1,15200,1,4,12,16299,15,1,275,1606,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,15


In [49]:
#colname = "OsBuildLab_4"
#print (full_features[colname].value_counts())
#print (colname, full_features[colname].value_counts().keys())

In [50]:
# After splitting the columns, the only values we need to remap are OsBuildLab_3 and OsBuildLab_4
# Other values are already numeric

# OsBuildLab_3 Index(['amd64fre', 'x86fre', 'arm64fre'], dtype='object')

colname = "OsBuildLab_3"
oldvalues = ['amd64fre', 'x86fre', 'arm64fre']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'amd64fre')

Previous values ['amd64fre' 'x86fre' 'arm64fre']
New values [1 3 2]


In [51]:
# OsBuildLab_4 Index(['rs4_release', 'rs3_release_svc_escrow', 'rs3_release', 'rs2_release', 'rs1_release', 'th2_release_sec', 'th1', 'winblue_ltsb_escrow', 'th2_release', 'rs1_release_inmarket', 'winblue_ltsb', 'win7sp1_ldr', 'rs3_release_svc', 'rs1_release_1', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'th1_st1', 'rs5_release', 'rs1_release_inmarket_aim', 'rs3_release_svc_escrow_im', 'th2_release_inmarket', 'rs_prerelease', 'rs_prerelease_flt', 'win7sp1_gdr', 'winblue_gdr', 'th1_escrow', 'win7_gdr', 'winblue_r4', 'rs1_release_inmarket_rim', 'rs1_release_d', 'winblue_r9', 'winblue_r5', 'win7_rtm', 'win7sp1_rtm', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'rs5_release_sigma', 'win7_ldr', 'rs5_release_sigma_dev', 'rs_xbox', 'rs5_release_edge', 'winblue_rtm', 'win7sp1_rc', 'rs3_release_svc_sec', 'rs_onecore_base_cobalt', 'rs6_prerelease', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs_shell']

colname = "OsBuildLab_4"
oldvalues = ['rs6_prerelease',
             'rs5_release', 'rs5_release_sigma', 'rs5_release_sigma_dev', 'rs5_release_edge', 'rs5_release_sign',
             'rs4_release', 
             'rs3_release_svc_escrow', 'rs3_release', 'rs3_release_svc', 'rs3_release_svc_escrow_im', 'rs3_release_svc_sec', 
             'rs2_release', 
             'rs1_release', 'rs1_release_inmarket', 'rs1_release_1', 'rs1_release_sec', 'rs1_release_inmarket_aim', 'rs1_release_inmarket_rim', 'rs1_release_d', 
             'rs_prerelease', 'rs_prerelease_flt',
             'th2_release_sec', 'th2_release', 'th2_release_inmarket', 
             'th1', 'th1_st1', 'th1_escrow', 
             'winblue_ltsb_escrow', 'winblue_ltsb', 'winblue_gdr', 'winblue_r4', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'winblue_r9', 'winblue_r5', 'winblue_rtm',
             'win7sp1_ldr', 'win7sp1_ldr_escrow', 'win7sp1_gdr', 'win7_gdr', 'win7_rtm', 'win7sp1_rtm', 'win7_ldr', 'win7sp1_rc', 
             'rs_xbox', 'rs_onecore_base_cobalt', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs_shell',
             'other']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'other')

Previous values ['rs3_release_svc_escrow' 'rs4_release' 'th1' 'rs3_release'
 'rs1_release_inmarket' 'rs2_release' 'th2_release_sec'
 'win7sp1_ldr_escrow' 'rs1_release_1' 'winblue_ltsb_escrow' 'th2_release'
 'rs1_release' 'rs1_release_sec' 'th1_st1' 'winblue_ltsb' 'rs1_release_d'
 'rs3_release_svc' 'th2_release_inmarket' 'rs5_release' 'win7sp1_ldr'
 'winblue_r9' 'win7_gdr' 'rs3_release_svc_escrow_im'
 'rs1_release_inmarket_aim' 'rs1_release_inmarket_rim' 'rs_prerelease_flt'
 'win7sp1_gdr' 'rs_prerelease' 'winblue_r7' 'winblue_r3' 'winblue_r4'
 'winblue_gdr' 'other' 'winblue_r5' 'th1_escrow' 'win7sp1_rtm' 'win7_ldr'
 'winblue_r8']
New values [ 8  7 26  9 15 13 23 40 16 29 24 14 17 27 30 20 10 25  2 39 36 42 11 18
 19 22 41 21 33 34 32 31 52 37 28 44 45 35]


In [52]:
# Version 1.2.3.4 vas converted to columns

# Version_1 = 1
# Version_2 = 2
# Version_3 = 3
# Version_4= 4
# So the column Version is not needed any more

versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']

full_features = full_features.drop(versions, axis=1)

In [53]:
# Modify all columns which are not interger type to integer and replace NaN/NULL values with -1

for colname in full_features.columns:
    if full_features[colname].dtypes.name not in ["int8","int16","int32"]:
        #topvalue = full_features[colname].value_counts().idxmax()
        topvalue = -1
        full_features[colname].fillna(topvalue, inplace=True)
        full_features[colname] = pd.to_numeric(full_features[colname], errors='coerce')
        full_features[colname].fillna(topvalue, inplace=True)
        

In [54]:
full_features.head(10)

Unnamed: 0_level_0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,CORE,EDUCATION,PRO,ENTERPRISE,CLOUD,SERVER,EVALUATION,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1
007e53a6444ec04701a3adeb333a640f,1,0,7.0,0,53447.0,1.0,1.0,1,164,82905.0,27.0,205.0,172,10,1,16299,256,503,55,1.0,0,0.0,117.0,4,1.0,1.0,1,1,1980.0,317708.0,4.0,5.0,2560.0,114473.0,2,113906.0,1,8192.0,2,20.1,1680.0,1050.0,2,2,-1.0,1,23,16299,19,2,1,27.0,120,3,0,1,1,-1.0,0.0,1,0.0,142.0,34833.0,0,0.0,0.0,0,0,0.0,0.0,15.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,445,0,10,0,0,0,16299,431,1,8,180502,1908,10,0,16299,19
1997e7ac97564c000d606a8416f86e33,1,0,7.0,0,7945.0,2.0,1.0,1,208,87568.0,27.0,240.0,233,10,1,17134,256,504,55,1.0,0,0.0,137.0,4,1.0,1.0,1,1,1980.0,317708.0,2.0,5.0,3439.0,476940.0,1,241228.0,0,4096.0,2,43.0,1768.0,992.0,2,2,4294967000.0,1,24,17134,228,2,1,9.0,34,1,0,2,1,-1.0,0.0,1,-1.0,142.0,34558.0,0,-1.0,0.0,0,0,0.0,1.0,10.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,913,0,10,0,0,0,17134,1,1,7,180410,1804,10,0,17134,228
5a6a183023c33bf9532c260938d0f4de,1,0,7.0,0,53447.0,1.0,1.0,1,91,110905.0,-1.0,125.0,75,10,1,16299,768,503,52,1.0,0,0.0,117.0,6,1.0,1.0,2,1,1443.0,256665.0,4.0,5.0,2867.0,953869.0,1,939083.0,0,8192.0,3,15.5,1366.0,768.0,1,2,0.0,1,23,16299,547,1,3,8.0,31,4,0,1,1,-1.0,0.0,1,-1.0,355.0,19956.0,0,-1.0,0.0,1,0,0.0,0.0,11.0,0,0,0,0,0,0,0,0,1,1,15100,1,4,18,1807,18075,1,273,1826,0,10,0,0,0,16299,431,1,8,180502,1908,10,0,16299,547
fa46e48f81b29a8931685274a07c6712,1,0,7.0,0,53447.0,1.0,1.0,1,203,143770.0,18.0,255.0,46,10,1,17134,256,504,55,1.0,0,0.0,137.0,4,1.0,1.0,1,1,1980.0,187292.0,12.0,1.0,1266.0,228936.0,2,228434.0,0,8192.0,2,22.0,1680.0,1050.0,2,2,4294967000.0,1,24,17134,286,2,2,39.0,160,2,0,2,1,-1.0,0.0,1,-1.0,142.0,34103.0,0,-1.0,0.0,0,0,0.0,0.0,7.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,13,17134,1,1,275,1632,0,10,0,0,0,17134,1,1,7,180410,1804,10,0,17134,286
bea3c5a849fa17b7405c061dfed6310e,1,0,7.0,0,5439.0,3.0,1.0,1,220,26777.0,-1.0,237.0,72,10,1,16299,768,503,52,1.0,0,0.0,117.0,6,1.0,1.0,2,1,525.0,331207.0,2.0,5.0,1998.0,476940.0,1,476164.0,0,2048.0,1,15.5,1366.0,768.0,1,1,100.0,1,23,16299,611,1,1,8.0,31,3,0,1,2,-1.0,0.0,1,0.0,142.0,69888.0,1,0.0,0.0,0,0,0.0,0.0,11.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,1718,0,10,0,0,0,16299,431,1,8,180502,1908,10,0,16299,611
8787d2add74a659b150793b178d047f4,1,0,7.0,0,53447.0,1.0,1.0,1,110,7035.0,27.0,211.0,182,10,1,17134,768,504,52,1.0,0,0.0,137.0,4,1.0,1.0,2,1,2668.0,34481.0,4.0,5.0,2257.0,476940.0,1,149900.0,0,4096.0,1,11.4,800.0,600.0,1,2,0.0,1,24,17134,228,1,2,29.0,125,1,0,2,1,-1.0,0.0,1,-1.0,628.0,9993.0,0,-1.0,0.0,0,0,0.0,0.0,3.0,1,0,0,0,0,0,0,0,1,1,15100,1,4,18,1807,18075,1,273,1860,0,10,0,0,0,17134,1,1,7,180410,1804,10,0,17134,228
8c0b60cdb5c58e219eb3d5c87dfbb5a0,1,0,7.0,0,53447.0,1.0,1.0,1,108,75425.0,-1.0,140.0,75,10,1,17134,256,504,55,1.0,0,0.0,137.0,6,1.0,1.0,1,1,1443.0,275864.0,4.0,5.0,2624.0,953869.0,1,944600.0,0,8192.0,5,20.0,1600.0,900.0,2,2,4294967000.0,1,24,17134,228,2,1,8.0,31,1,0,1,1,-1.0,0.0,1,-1.0,355.0,19948.0,1,-1.0,0.0,0,0,0.0,0.0,11.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,1107,0,10,0,0,0,17134,1,1,7,180410,1804,10,0,17134,228
dc363d7018cf153417d819ca9014c405,1,0,0.0,1,10501.0,4.0,2.0,1,124,29356.0,-1.0,277.0,75,10,1,17134,256,504,55,1.0,0,0.0,137.0,6,1.0,1.0,1,1,1443.0,275890.0,4.0,5.0,2273.0,114473.0,2,113857.0,1,8192.0,2,19.4,1600.0,900.0,2,2,4294967000.0,1,24,17134,228,2,1,8.0,31,1,0,1,1,-1.0,0.0,1,-1.0,355.0,19970.0,0,-1.0,0.0,0,0,0.0,0.0,3.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,497,0,10,0,0,0,17134,1,1,7,180410,1804,10,0,17134,228
1462359bd0df264b1f3c08dbeb249ebc,1,0,7.0,0,53447.0,1.0,1.0,1,151,78182.0,27.0,277.0,75,10,1,10240,256,201,70,1.0,0,0.0,41.0,6,1.0,1.0,2,1,1443.0,256585.0,4.0,5.0,2321.0,476940.0,1,99500.0,0,2048.0,3,15.3,1366.0,768.0,1,1,4294967000.0,1,11,10240,16384,4,2,8.0,31,2,0,1,3,0.0,0.0,2,0.0,355.0,19948.0,0,0.0,0.0,0,0,0.0,0.0,1.0,0,0,0,0,0,0,0,0,1,1,14700,5,4,8,10240,16384,1,265,228,0,10,0,0,0,10240,16384,1,26,150709,1700,10,0,10240,16384
dac5a4c9ed6813cdf58ccced6be81dcf,1,0,7.0,0,53447.0,1.0,1.0,1,214,-1.0,50.0,277.0,75,10,1,16299,768,503,52,1.0,0,0.0,111.0,4,1.0,1.0,2,1,2668.0,172108.0,8.0,5.0,2737.0,953869.0,1,251254.0,0,8192.0,1,15.5,1920.0,1080.0,1,2,0.0,1,23,16299,15,1,2,8.0,31,2,0,1,2,-1.0,0.0,1,-1.0,628.0,15894.0,0,-1.0,0.0,0,0,0.0,0.0,1.0,1,0,0,0,0,0,0,0,1,1,15200,1,4,12,16299,15,1,275,1606,0,10,0,0,0,16299,15,1,9,170928,1534,10,0,16299,15


In [55]:
# Let's see some details of the loaded data
full_features.describe()

Unnamed: 0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,CORE,EDUCATION,PRO,ENTERPRISE,CLOUD,SERVER,EVALUATION,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
count,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0,180000.0
mean,1.010406,2.2e-05,6.817367,0.017333,47675.479406,1.315689,1.01145,0.988206,107.874467,78400.258767,16.91865,169.586194,122.508811,13.313717,1.182194,15726.3461,573.991194,477.402028,52.639683,0.937339,3.3e-05,-0.060567,125.659606,4.850294,0.958217,0.991989,2.196072,1.001689,2196.474717,236284.098133,3.970533,4.505772,2360.734972,509376.8,1.422911,373997.4,0.077533,6056.931422,1.638467,16.602685,1540.032506,893.355828,1.402289,1.717089,1097317000.0,1.182372,22.100283,15840.1502,973.759728,1.401906,2.945578,14.506017,60.4548,1.881439,0.000628,1.148606,1.520306,-0.830311,-0.017783,1.093378,-0.634961,395.181656,32369.515422,0.484244,-0.634139,0.005667,0.126572,0.038567,0.049178,0.238561,7.580189,0.499789,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15074.383833,1.297344,4.0,15.862961,5670.104667,14084.158161,1.0,272.352033,931.7827,0.0,9.872111,0.075417,0.010433,0.0,15726.250906,1413.429422,1.182183,10.7101,176489.5081,1776.780411,10.0,0.0,15840.1502,973.756072
std,0.101476,0.004714,1.125738,0.13051,14321.942773,0.541026,0.209569,0.10796,62.923918,50457.417036,12.828057,89.314179,69.332278,82.373453,0.575462,2180.095486,248.377477,80.562367,5.884556,0.258804,0.005773,0.240159,43.569029,1.269305,0.245615,0.18792,1.316782,0.041061,1327.099917,75847.890684,2.121441,1.337697,854.010302,372272.7,0.624663,324430.3,0.267437,5067.012155,1.559532,6.037045,385.547283,224.816804,0.715464,0.450415,1873193000.0,0.575715,3.517571,1952.62618,2934.202749,0.569602,1.821584,10.253234,45.034499,0.935657,0.025048,0.435598,0.595166,0.37536,0.132163,0.381012,0.481627,226.943832,21445.413497,0.499753,0.481672,0.095109,0.332494,0.19256,0.250097,0.50115,4.753896,0.500001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,278.417532,1.026592,0.0,3.320319,6271.846393,7384.924796,0.0,6.038691,533.168794,0.0,0.703706,0.447285,0.109504,0.0,2180.4081,4596.2857,0.575447,6.305566,6048.864411,210.333102,0.0,0.0,1952.62618,2934.203313
min,1.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,-1.0,-1.0,-1.0,1.0,7.0,1.0,7600.0,16.0,201.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,-1.0,1.0,0.0,7601.0,0.0,1.0,1.0,-1.0,2.0,1.0,0.0,1.0,1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,11502.0,0.0,4.0,4.0,204.0,0.0,1.0,195.0,0.0,0.0,6.0,0.0,0.0,0.0,-1.0,-1.0,1.0,2.0,-1.0,-1.0,10.0,0.0,7601.0,0.0
25%,1.0,0.0,7.0,0.0,49480.0,1.0,1.0,1.0,51.0,31368.0,-1.0,89.0,74.0,10.0,1.0,15063.0,256.0,502.0,52.0,1.0,0.0,0.0,108.0,4.0,1.0,1.0,2.0,1.0,1443.0,189547.0,2.0,5.0,1998.0,238475.0,1.0,120284.5,0.0,4096.0,1.0,13.9,1366.0,768.0,1.0,1.0,0.0,1.0,22.0,15063.0,167.0,1.0,1.0,8.0,31.0,1.0,0.0,1.0,1.0,-1.0,0.0,1.0,-1.0,142.0,12463.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15100.0,1.0,4.0,13.0,1807.0,17443.0,1.0,273.0,491.0,0.0,10.0,0.0,0.0,0.0,15063.0,1.0,1.0,7.0,170928.0,1804.0,10.0,0.0,15063.0,167.0
50%,1.0,0.0,7.0,0.0,53447.0,1.0,1.0,1.0,97.0,77866.0,18.0,181.0,88.0,10.0,1.0,16299.0,768.0,503.0,52.0,1.0,0.0,0.0,117.0,4.0,1.0,1.0,2.0,1.0,2102.0,246001.0,4.0,5.0,2500.0,476940.0,1.0,245496.5,0.0,4096.0,1.0,15.5,1366.0,768.0,1.0,2.0,0.0,1.0,23.0,16299.0,285.0,1.0,3.0,9.0,34.0,2.0,0.0,1.0,1.0,-1.0,0.0,1.0,-1.0,488.0,33054.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15100.0,1.0,4.0,18.0,1807.0,18075.0,1.0,273.0,941.0,0.0,10.0,0.0,0.0,0.0,16299.0,1.0,1.0,8.0,180410.0,1804.0,10.0,0.0,16299.0,285.0
75%,1.0,0.0,7.0,0.0,53447.0,2.0,1.0,1.0,162.0,121599.25,27.0,267.0,182.0,10.0,1.0,17134.0,768.0,504.0,55.0,1.0,0.0,0.0,137.0,6.0,1.0,1.0,2.0,1.0,2668.0,302378.75,4.0,5.0,2867.0,953869.0,2.0,475959.0,0.0,8192.0,2.0,17.2,1920.0,1080.0,2.0,2.0,4294967000.0,1.0,24.0,17134.0,547.0,2.0,4.0,20.0,90.0,3.0,0.0,1.0,2.0,-1.0,0.0,1.0,0.0,556.0,52173.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15200.0,1.0,4.0,18.0,10586.0,18075.0,1.0,275.0,1379.0,0.0,10.0,0.0,0.0,0.0,17134.0,431.0,1.0,13.0,180410.0,1834.0,10.0,0.0,17134.0,547.0
max,2.0,1.0,8.0,1.0,70492.0,6.0,4.0,1.0,222.0,167953.0,52.0,296.0,278.0,2016.0,3.0,18242.0,784.0,505.0,90.0,1.0,1.0,1.0,429.0,6.0,1.0,48.0,12.0,2.0,6141.0,345480.0,144.0,10.0,4472.0,45785070.0,3.0,13349830.0,1.0,671744.0,17.0,142.0,5760.0,3840.0,8.0,2.0,4294967000.0,3.0,25.0,18242.0,17976.0,7.0,9.0,39.0,162.0,6.0,1.0,4.0,3.0,0.0,0.0,3.0,1.0,1083.0,72080.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15300.0,6.0,4.0,18.0,17686.0,20063.0,1.0,277.0,4320.0,0.0,10.0,3.0,16.0,0.0,18242.0,24236.0,3.0,52.0,180914.0,2340.0,10.0,0.0,18242.0,17976.0


In [56]:
full_features['UacLuaenable'].unique()

array([ 1.,  0., -1., 48.,  2.])

In [57]:
full_features.to_csv('regularized.csv')

In [58]:
full_features.dtypes

ProductName                                             int8
IsBeta                                                  int8
RtpStateBitfield                                     float64
IsSxsPassiveMode                                        int8
AVProductStatesIdentifier                            float64
AVProductsInstalled                                  float64
AVProductsEnabled                                    float64
HasTpm                                                  int8
CountryIdentifier                                      int64
CityIdentifier                                       float64
OrganizationIdentifier                               float64
GeoNameIdentifier                                    float64
LocaleEnglishNameIdentifier                            int64
Platform                                               int16
Processor                                               int8
OsBuild                                                int64
OsSuite                 

In [59]:
# Shuffle the data

shuffle = np.random.permutation(np.arange(full_features.shape[0]))
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

In [60]:
full_labels = full_features["HasDetections"]

# Dropping labels ["HasDetections"] from training dataset
full_features = full_features.drop(["HasDetections"], axis=1)

In [61]:
# Prepare Train and test features and labels
train_count = int(len(full_features) * 0.8)

train_features = full_features.values[:train_count]
test_features  = full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

In [62]:
train_features.shape

(144000, 104)

In [63]:
test_features.shape

(36000, 104)

In [64]:
scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print ("All columns (normalized)", train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", all_columns_score*100)

All columns (normalized) (144000, 104) (36000, 104) (144000,) (36000,) HistGradientBoostingClassifier 64.28055555555555


In [65]:
bruteforced_columns = ['ProductName', 'IsBeta', 'RtpStateBitfield', 'IsSxsPassiveMode',
       'AVProductStatesIdentifier', 'AVProductsInstalled', 'AVProductsEnabled',
       'CountryIdentifier', 'CityIdentifier', 'OrganizationIdentifier',
       'GeoNameIdentifier', 'LocaleEnglishNameIdentifier', 'Platform',
       'Processor', 'OsSuite', 'OsPlatformSubRelease', 'SkuEdition',
       'IsProtected', 'AutoSampleOptIn', 'SMode', 'IeVerIdentifier',
       'SmartScreen', 'Firewall', 'UacLuaenable', 'Census_MDC2FormFactor',
       'Census_DeviceFamily', 'Census_OEMNameIdentifier',
       'Census_ProcessorManufacturerIdentifier',
       'Census_ProcessorModelIdentifier', 'Census_PrimaryDiskTotalCapacity',
       'Census_PrimaryDiskTypeName', 'Census_SystemVolumeTotalCapacity',
       'Census_HasOpticalDiskDrive', 'Census_TotalPhysicalRAM',
       'Census_ChassisTypeName',
       'Census_InternalPrimaryDiagonalDisplaySizeInInches',
       'Census_InternalPrimaryDisplayResolutionHorizontal',
       'Census_InternalPrimaryDisplayResolutionVertical',
       'Census_PowerPlatformRoleName', 'Census_InternalBatteryNumberOfCharges',
       'Census_OSArchitecture', 'Census_OSBranch', 'Census_OSBuildNumber',
       'Census_OSBuildRevision', 'Census_OSEdition',
       'Census_OSInstallTypeName', 'Census_OSInstallLanguageIdentifier',
       'Census_OSUILocaleIdentifier', 'Census_OSWUAutoUpdateOptionsName',
       'Census_IsPortableOperatingSystem', 'Census_GenuineStateName',
       'Census_ActivationChannel', 'Census_IsFlightsDisabled',
       'Census_FlightRing', 'Census_ThresholdOptIn',
       'Census_FirmwareManufacturerIdentifier',
       'Census_FirmwareVersionIdentifier', 'Census_IsSecureBootEnabled',
       'Census_IsWIMBootEnabled', 'Census_IsVirtualDevice',
       'Census_IsTouchEnabled', 'Census_IsPenCapable',
       'Census_IsAlwaysOnAlwaysConnectedCapable', 'Wdft_IsGamer',
       'Wdft_RegionIdentifier', 'EngineVersion_1', 'EngineVersion_2',
       'EngineVersion_3', 'EngineVersion_4', 'AppVersion_1', 'AppVersion_2',
       'AppVersion_3', 'AppVersion_4', 'AvSigVersion_1', 'AvSigVersion_2',
       'AvSigVersion_3', 'AvSigVersion_4', 'OsVer_1', 'OsVer_2', 'OsVer_3',
       'OsVer_4', 'OsBuildLab_1', 'OsBuildLab_2', 'OsBuildLab_3',
       'OsBuildLab_4', 'OsBuildLab_5', 'OsBuildLab_6', 'Census_OSVersion_1',
       'Census_OSVersion_2', 'Census_OSVersion_3', 'Census_OSVersion_4',
       'CORE', 'EDUCATION', 'PRO', 'ENTERPRISE', 'CLOUD', 'SERVER', 'EVALUATION']

bruteforced_train_features = full_features[bruteforced_columns].values[:train_count]
bruteforced_test_features  = full_features[bruteforced_columns].values[train_count:]

print ("Bruteforced", bruteforced_train_features.shape, bruteforced_test_features.shape, train_labels.shape, test_labels.shape)

Bruteforced (144000, 98) (36000, 98) (144000,) (36000,)


In [66]:
# Run HistGradientBoostingClassifier on Bruteforced training and test data

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(bruteforced_train_features, train_labels)
bruteforced_columns_score = clf.score(bruteforced_test_features, test_labels)
    
print ("Bruteforced", bruteforced_train_features.shape, bruteforced_test_features.shape, 
       train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", bruteforced_columns_score*100)

Bruteforced (144000, 98) (36000, 98) (144000,) (36000,) HistGradientBoostingClassifier 64.21944444444443


In [67]:
full_features[bruteforced_columns].dtypes

ProductName                                             int8
IsBeta                                                  int8
RtpStateBitfield                                     float64
IsSxsPassiveMode                                        int8
AVProductStatesIdentifier                            float64
AVProductsInstalled                                  float64
AVProductsEnabled                                    float64
CountryIdentifier                                      int64
CityIdentifier                                       float64
OrganizationIdentifier                               float64
GeoNameIdentifier                                    float64
LocaleEnglishNameIdentifier                            int64
Platform                                               int16
Processor                                               int8
OsSuite                                                int64
OsPlatformSubRelease                                   int16
SkuEdition              

In [68]:
engineered_columns = bruteforced_columns

full_features['ScreenProportion'] = full_features['Census_InternalPrimaryDisplayResolutionHorizontal'] / full_features['Census_InternalPrimaryDisplayResolutionVertical']
full_features['ScreenDimensions'] = (full_features['Census_InternalPrimaryDisplayResolutionHorizontal'] * 10000) + full_features['Census_InternalPrimaryDisplayResolutionVertical']
full_features['CapacityDifference'] = full_features['Census_SystemVolumeTotalCapacity'] / full_features['Census_PrimaryDiskTotalCapacity']
full_features['CapacityRatio'] = full_features['Census_SystemVolumeTotalCapacity'] - full_features['Census_PrimaryDiskTotalCapacity']
full_features['RAMByCores'] = full_features['Census_TotalPhysicalRAM'] / full_features['Census_ProcessorCoreCount'] 

full_features['ScreenProportion'] = full_features['ScreenProportion'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['ScreenDimensions'] = full_features['ScreenDimensions'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['CapacityDifference'] = full_features['CapacityDifference'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['CapacityRatio'] = full_features['CapacityRatio'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['RAMByCores'] = full_features['RAMByCores'].replace([np.inf, -np.inf], np.nan).fillna(-1)

engineered_columns.extend(['ScreenProportion', 'ScreenDimensions','CapacityDifference','CapacityRatio','RAMByCores'])

engineered_train_features = full_features[engineered_columns].values[:train_count]
engineered_test_features  = full_features[engineered_columns].values[train_count:]

print ("Engineered", engineered_train_features.shape, engineered_test_features.shape, train_labels.shape, test_labels.shape)

Engineered (144000, 103) (36000, 103) (144000,) (36000,)


In [69]:
clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(engineered_train_features, train_labels)
engineered_columns_score = clf.score(engineered_test_features, test_labels)
    
print ("Engineered", engineered_train_features.shape, engineered_test_features.shape, 
       train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", engineered_columns_score*100)

Engineered (144000, 103) (36000, 103) (144000,) (36000,) HistGradientBoostingClassifier 64.225


In [70]:
full_features["HasDetections"] = full_labels
engineered_columns.append("HasDetections")

In [71]:
full_features[engineered_columns].to_csv('bruteforced_engineered.csv')

In [72]:
full_features[engineered_columns].dtypes

ProductName                                             int8
IsBeta                                                  int8
RtpStateBitfield                                     float64
IsSxsPassiveMode                                        int8
AVProductStatesIdentifier                            float64
AVProductsInstalled                                  float64
AVProductsEnabled                                    float64
CountryIdentifier                                      int64
CityIdentifier                                       float64
OrganizationIdentifier                               float64
GeoNameIdentifier                                    float64
LocaleEnglishNameIdentifier                            int64
Platform                                               int16
Processor                                               int8
OsSuite                                                int64
OsPlatformSubRelease                                   int16
SkuEdition              