The goal of this competition is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. Using the information and labels in train.csv, you must predict the value for HasDetections for each machine in test.csv.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines.

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.experimental import enable_hist_gradient_boosting
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

In [2]:
# set up display area to show dataframe in jupyter qtconsole
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [3]:
# We need to explicitly specify data types when reading csv, otherwise it is very memory consuming
# and we will get the warning "Specify dtype option on import or set low_memory=False"
# So, we will manually defined the data types

# P.S. I have loaded the sample data and exported train_data.dtypes
# these are the data types for fast loading

datatypes = {
    'ProductName': str,
    'EngineVersion': str,
    'AppVersion': str,
    'AvSigVersion': str,
    'IsBeta': np.int8,
    'RtpStateBitfield': str,
    'IsSxsPassiveMode': np.int8,
    'DefaultBrowsersIdentifier': str,
    'AVProductStatesIdentifier': str,
    'AVProductsInstalled': str,
    'AVProductsEnabled': str,
    'HasTpm': np.int8,
    'CountryIdentifier': str,
    'CityIdentifier': str,
    'OrganizationIdentifier': str,
    'GeoNameIdentifier': str,
    'LocaleEnglishNameIdentifier': str,
    'Platform': str,
    'Processor': str,
    'OsVer': str,
    'OsBuild': str,
    'OsSuite': str,
    'OsPlatformSubRelease': str,
    'OsBuildLab': str,
    'SkuEdition': str,
    'IsProtected': str,
    'AutoSampleOptIn': np.int8,
    'PuaMode': str,
    'SMode': str,
    'IeVerIdentifier': str,
    'SmartScreen': str,
    'Firewall': str,
    'UacLuaenable': str,
    'Census_MDC2FormFactor': str,
    'Census_DeviceFamily': str,
    'Census_OEMNameIdentifier': str,
    'Census_OEMModelIdentifier': str, 
    'Census_ProcessorCoreCount': str,
    'Census_ProcessorManufacturerIdentifier': str,
    'Census_ProcessorModelIdentifier': str,
    'Census_ProcessorClass': str,
    'Census_PrimaryDiskTotalCapacity': np.float64,
    'Census_PrimaryDiskTypeName': str,
    'Census_SystemVolumeTotalCapacity': np.float64,
    'Census_HasOpticalDiskDrive': np.int8,
    'Census_TotalPhysicalRAM': np.float64,
    'Census_ChassisTypeName': str,
    'Census_InternalPrimaryDiagonalDisplaySizeInInches': str,
    'Census_InternalPrimaryDisplayResolutionHorizontal': str,
    'Census_InternalPrimaryDisplayResolutionVertical': str,
    'Census_PowerPlatformRoleName': str,
    'Census_InternalBatteryType': str,
    'Census_InternalBatteryNumberOfCharges': str,
    'Census_OSVersion': str,
    'Census_OSArchitecture': str,
    'Census_OSBranch': str,
    'Census_OSBuildNumber': str,
    'Census_OSBuildRevision': str,
    'Census_OSEdition': str,
    'Census_OSSkuName': str,
    'Census_OSInstallTypeName': str,
    'Census_OSInstallLanguageIdentifier': str,
    'Census_OSUILocaleIdentifier': str,
    'Census_OSWUAutoUpdateOptionsName': str,
    'Census_IsPortableOperatingSystem': np.int8,
    'Census_GenuineStateName': str,
    'Census_ActivationChannel': str,
    'Census_IsFlightingInternal': str,
    'Census_IsFlightsDisabled': str,
    'Census_FlightRing': str,
    'Census_ThresholdOptIn': str,
    'Census_FirmwareManufacturerIdentifier': str,
    'Census_FirmwareVersionIdentifier': str,
    'Census_IsSecureBootEnabled': np.int8,
    'Census_IsWIMBootEnabled': str,
    'Census_IsVirtualDevice': str,
    'Census_IsTouchEnabled': np.int8,
    'Census_IsPenCapable': np.int8,
    'Census_IsAlwaysOnAlwaysConnectedCapable': str,
    'Wdft_IsGamer': str,
    'Wdft_RegionIdentifier': str,
    'HasDetections': np.int8
}

full_features = pd.read_csv("train.csv", dtype=datatypes, index_col="MachineIdentifier")

In [4]:
print (full_features.shape)

(8921483, 82)


In [5]:
# Optional
# For speeding up the processes, we will shuffle the data and take only 200,000 rows. Otherwise it will take quite a bit

# Shuffle the data

shuffle = np.random.permutation(np.arange(full_features.shape[0]))[:200000]
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

print (full_features.shape)

(200000, 82)


In [6]:
# Checking the columns with the most NULL values
print((full_features.isnull().sum()).sort_values(ascending=False).head(20))

PuaMode                                  199941
Census_ProcessorClass                    199184
DefaultBrowsersIdentifier                190372
Census_IsFlightingInternal               165913
Census_InternalBatteryType               142006
Census_ThresholdOptIn                    127046
Census_IsWIMBootEnabled                  126882
SmartScreen                               71278
OrganizationIdentifier                    61155
SMode                                     12106
CityIdentifier                             7095
Wdft_IsGamer                               6746
Wdft_RegionIdentifier                      6746
Census_InternalBatteryNumberOfCharges      6116
Census_FirmwareManufacturerIdentifier      4152
Census_IsFlightsDisabled                   3657
Census_FirmwareVersionIdentifier           3609
Census_OEMModelIdentifier                  2252
Census_OEMNameIdentifier                   2108
Firewall                                   2027
dtype: int64


In [7]:
full_features['PuaMode'].unique()

array([nan, 'on'], dtype=object)

In [8]:
full_features['Census_IsFlightingInternal'].unique()

array([nan, '0'], dtype=object)

In [9]:
full_features['Census_InternalBatteryType'].unique()

array([nan, 'lion', 'lip', 'li-i', '#', 'pad0', 'liio', 'li', 'bq20',
       'pbac', 'nimh', 'real', 'li p', 'lgi0', 'ots0', 'unkn', 'vbox',
       '4cel', 'lgs0', 'ithi', 'lipo', 'ÿÿÿÿ', 'lhp0', 'virt', 'lipp'],
      dtype=object)

In [10]:
full_features['Census_ThresholdOptIn'].unique()

array([nan, '0', '1'], dtype=object)

In [11]:
full_features['Census_IsWIMBootEnabled'].unique()

array([nan, '0'], dtype=object)

In [12]:
full_features['SMode'].unique()

array(['0', nan, '1'], dtype=object)

In [13]:
full_features['OrganizationIdentifier'].unique()

array(['48', '27', nan, '18', '50', '37', '14', '49', '46', '32', '33',
       '4', '11', '36', '47', '8', '52', '20', '44', '2', '28', '51',
       '40', '22', '1', '45', '10', '5', '39', '21', '31', '16', '26',
       '3', '41', '30', '6', '19', '7', '29', '23', '42', '35'],
      dtype=object)

In [14]:
full_features['Wdft_IsGamer'].unique()

array(['0', '1', nan], dtype=object)

In [15]:
full_features['Wdft_RegionIdentifier'].unique()

array(['10', '15', '3', '13', '1', '11', '7', '4', '9', '5', '2', nan,
       '8', '6', '12', '14'], dtype=object)

In [16]:
full_features['CityIdentifier'].unique()

array(['107470', '92208', '165694', ..., '149651', '68816', '77901'],
      dtype=object)

In [17]:
full_features['Census_InternalBatteryNumberOfCharges'].unique()

array(['126', '0', nan, ..., '35757', '11992', '23417'], dtype=object)

In [18]:
# Cleaning up some data

# PuaMode - Potentially Unwanted Applications, if NA, then it is disabled. 99% are NA. So, better to drop it
# Census_ProcessorClass - According to the description - "No longer maintained and updated"
# DefaultBrowsersIdentifier - Almost all values are empty. Therefore we will drop this column
# Census_IsFlightingInternal - whether this is internal or "external" testing ring. Column mostly unused. Will have to drop it
# Census_InternalBatteryType - comtains mostly garbage. Besides, it should not be relevant to attack surface.
# Census_ThresholdOptIn - also mostly unused. Googled it and Threshold was used in first versions of Windows 10. Looks like unused now
# Census_IsWIMBootEnabled - Is it possible to boot from Windows Image? Not relevant to identification of the attacks when 70% of data is emtpy
# SmartScreen - Whether smart screen in explorer is enabled. Should be important. "ExistsNotSet" when null, according to the description
# SMode - Quite relevant field. Will be keeping it
# OrganizationIdentifier - Attacks by organizations should be analyzed. If not filled, will assign "0". 
# Census_InternalBatteryNumberOfCharges - Not relevant. Will drop this column in order not to overtrain
# Census_OSSkuName -  OS edition friendly name (currently Windows only). - Can be removed. Duplicate field
# Census_ChassisTypeName - Census_MDC2FormFactor gives better information. Let's remove this field

#full_features['PuaMode'] = full_features['PuaMode'].fillna('off')
#full_features['SmartScreen'] = full_features['SmartScreen'].fillna('ExistsNotSet')
#full_features['SMode'] = full_features['SMode'].fillna('0').astype('int8')
#full_features['OrganizationIdentifier'] = full_features['OrganizationIdentifier'].fillna('0').astype('int32')
#full_features['Wdft_IsGamer'] = full_features['Wdft_IsGamer'].fillna('0').astype('int8')
#full_features['Wdft_RegionIdentifier'] = full_features['Wdft_RegionIdentifier'].fillna('0').astype('int32')
#full_features['CityIdentifier'] = full_features['CityIdentifier'].fillna('0').astype('int32')

#full_features = full_features.drop([
#    'PuaMode',
#    'Census_OSEdition',
#    'Census_ProcessorClass',
#    'DefaultBrowsersIdentifier',
#    'Census_IsFlightingInternal',
#    'Census_InternalBatteryType'], axis=1)

In [19]:
# Now let us check the string columns

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['ProductName',
 'EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'DefaultBrowsersIdentifier',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'CityIdentifier',
 'OrganizationIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'Platform',
 'Processor',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'OsBuildLab',
 'SkuEdition',
 'IsProtected',
 'PuaMode',
 'SMode',
 'IeVerIdentifier',
 'SmartScreen',
 'Firewall',
 'UacLuaenable',
 'Census_MDC2FormFactor',
 'Census_DeviceFamily',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_ProcessorClass',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'C

In [20]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,ProductName,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_ProcessorClass,Census_PrimaryDiskTypeName,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1
fecbcefeb3c7a3a046263810525a81b5,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1236.0,7,,7945,2,1,29,107470.0,48.0,35,171,windows10,x64,10.0.0.0,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1,,0,135,ExistsNotSet,1,1,Notebook,Windows.Desktop,4142,134956,4,5,2373,,HDD,Laptop,13.9,1366,768,Mobile,,126.0,10.0.16299.125,amd64,rs3_release,16299,125,CoreSingleLanguage,CORE_SINGLELANGUAGE,Upgrade,26,119,FullAuto,IS_GENUINE,Retail,,0,Retail,,142,51777,,0,0,0,10
d8df3d0556fa4a826fce204a9432e6dc,win8defender,1.1.15100.1,4.10.209.0,1.273.1652.0,7,,43747,2,2,141,92208.0,27.0,167,227,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718...,Home,1,,0,333,ExistsNotSet,1,1,Notebook,Windows.Desktop,2206,242491,4,1,289,,HDD,Notebook,15.5,1366,768,Mobile,lion,0.0,10.0.10586.494,amd64,th2_release,10586,494,Core,CORE,Update,8,31,FullAuto,IS_GENUINE,Retail,,0,Retail,0.0,554,33000,0.0,0,0,0,10
24d97ff81e15ad492d842119044c0871,win8defender,1.1.15100.1,4.18.1807.18075,1.273.642.0,7,,53447,1,1,60,165694.0,27.0,274,182,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1,,0,108,RequireAdmin,1,1,Desktop,Windows.Desktop,585,190133,4,5,3327,,UNKNOWN,Desktop,16.3,1366,768,Desktop,,,10.0.15063.1206,amd64,rs2_release,15063,1206,Core,CORE,Update,9,34,AutoInstallAndRebootAtMaintenanceTime,IS_GENUINE,Retail,,0,Retail,,93,51050,,0,0,1,15
71ce8a1e7833c1b026bf78c7cd2f181d,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1001.0,7,,22728,2,1,120,120697.0,,144,140,windows10,x64,10.0.0.0,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1,,0,117,RequireAdmin,1,1,Convertible,Windows.Desktop,2668,170943,4,5,2241,,SSD,Notebook,13.2,3200,1800,Mobile,lip,0.0,10.0.16299.371,amd64,rs3_release,16299,371,Core,CORE,Update,8,31,FullAuto,IS_GENUINE,Retail,0.0,0,Retail,0.0,628,21399,0.0,0,0,0,3
f5299e96739a95dc543df78f43d284b6,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1682.0,7,,53447,1,1,207,49499.0,27.0,277,75,windows10,x64,10.0.0.0,15063,256,rs2,15063.0.amd64fre.rs2_release.170317-1834,Pro,1,,0,108,ExistsNotSet,1,1,Desktop,Windows.Desktop,1980,333856,8,5,2951,,SSD,Desktop,27.0,1920,1080,Desktop,,4294967295.0,10.0.15063.1206,amd64,rs2_release,15063,1206,Professional,PROFESSIONAL,Update,8,31,AutoInstallAndRebootAtMaintenanceTime,IS_GENUINE,Retail,0.0,0,Retail,0.0,142,35595,0.0,0,0,1,13
7a025c72d473fe85a9799cf9b12a52e7,win8defender,1.1.15100.1,4.18.1807.18075,1.273.810.0,7,,47238,2,1,158,36285.0,48.0,202,70,windows10,x64,10.0.0.0,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1,,0,117,RequireAdmin,1,1,Notebook,Windows.Desktop,2668,171221,4,5,2373,,HDD,Notebook,13.9,1366,768,Mobile,li-i,0.0,10.0.16299.371,amd64,rs3_release,16299,371,CoreSingleLanguage,CORE_SINGLELANGUAGE,Update,8,31,Notify,IS_GENUINE,OEM:DM,,0,Retail,0.0,628,21675,0.0,0,0,0,1
b5062d8c7f2fb750df1be60e22ff2e04,win8defender,1.1.14600.4,4.13.17134.228,1.263.48.0,7,,3371,2,1,120,120697.0,,144,139,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1,,0,137,RequireAdmin,1,1,Notebook,Windows.Desktop,4730,302480,4,5,2574,,HDD,Notebook,15.5,1366,768,Mobile,lion,188.0,10.0.17134.228,amd64,rs4_release,17134,228,Core,CORE,UUPUpgrade,7,30,FullAuto,IS_GENUINE,OEM:NONSLP,0.0,0,Retail,0.0,513,9011,0.0,0,0,0,3
2439cf6f16a54e2f1f30e63ef5366733,win8defender,1.1.15100.1,4.18.1806.18062,1.273.483.0,0,,57629,2,1,214,146748.0,27.0,287,75,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.1805...,Pro,1,,0,117,,1,1,Notebook,Windows.Desktop,585,318034,4,5,2660,,HDD,Notebook,13.9,1920,1080,Mobile,,0.0,10.0.16299.492,amd64,rs3_release_svc_escrow,16299,492,ProfessionalEducation,PROFESSIONAL,Reset,8,31,UNKNOWN,IS_GENUINE,Retail,,0,Retail,,556,63317,,0,0,0,1
d4e5f2acb25087b70b8bcfa1b1ea2b9b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.1327.0,7,,49480,2,1,95,,18.0,121,168,windows10,x64,10.0.0.0,14393,768,rs1,14393.187.amd64fre.rs1_release_inmarket.160906...,Home,1,,0,94,RequireAdmin,1,1,Convertible,Windows.Desktop,585,313257,4,5,3499,,HDD,Convertible,11.6,1920,1080,Mobile,,0.0,10.0.14393.187,amd64,rs1_release,14393,187,CoreSingleLanguage,CORE_SINGLELANGUAGE,Other,8,42,UNKNOWN,IS_GENUINE,OEM:DM,,0,Retail,,556,63041,,0,0,0,11
7e3e226dd18552a7847ba6491ccb1fbb,win8defender,1.1.15200.1,4.18.1807.18075,1.275.843.0,7,,53447,1,1,61,923.0,,277,75,windows10,x64,10.0.0.0,16299,768,rs3,16299.15.amd64fre.rs3_release.170928-1534,Home,1,,0,111,,1,1,Notebook,Windows.Desktop,2102,241876,4,5,2412,,HDD,Notebook,15.5,1366,768,Mobile,,0.0,10.0.16299.192,amd64,rs3_release,16299,192,Core,CORE,IBSClean,8,31,UNKNOWN,IS_GENUINE,OEM:NONSLP,,0,Retail,,554,33135,,0,0,0,11


At first glance at the data, it becomes obvious, that the stings are either classifiers, or versions that contain 4 classifiers in them. So. in order to use the algorithms that support only numeric values we will convert classifiers like "ProductName" to integer range and the fields like AppVersion

In [21]:
def df_replacevalues(df, colname, oldvalues, newvalues, topvalue):
    # First, we need to get the most frequent value of the column
    #topvalue = df[colname].value_counts().idxmax() # Decided yo specify explicitly, so commenting out
    
    # Replace NaN values with the popular value
    df[colname].fillna(topvalue, inplace=True)
    
    # We need to make sure no other value than oldvalues exists
    indexes = df[~df[colname].isin(oldvalues)].index
    
    # If the "Garbage" values are more than 1%, then raise an error
    if len(indexes) > len(df) / 100:
        raise Exception("Not all neccessary values are present in oldvalues array")
    
    # Replace "Garbage" with the top value
    df.loc[indexes,[colname]] = topvalue
    
    print ("Previous values", df[colname].unique())
    df[colname] = pd.to_numeric(df[colname].replace(oldvalues, newvalues), errors='raise', downcast='integer')
    print ("New values", df[colname].unique())
    
#full_features["Platform"].unique()
#full_features["Platform"].value_counts()
#full_features[~full_features["ProductName"].isin(['win8defender', 'mse'])].index

Standard convertor accuracy was only 57% when the feature values of data type string. 

Replaced all the string values in the following features to numbers and used OrdinalEncoder, which increased the accuracy 5-10%.

- ProductName
- Platform
- Processor
- OsPlatformSubRelease
- SkuEdition
- SmartScreen
- Census_MDC2FormFactor
- Census_DeviceFamily
- Census_PrimaryDiskTypeName
- Census_ChassisTypeName
- Census_PowerPlatformRoleName
- Census_OSArchitecture
- Census_OSBranch
- Census_OSSkuName
- Census_MDC2FormFactor
- Census_DeviceFamily
- Census_PrimaryDiskTypeName
- Census_ChassisTypeName
- Census_PowerPlatformRoleName
- Census_OSArchitecture
- Census_OSBranch
- Census_OSSkuName
- Census_OSInstallTypeName
- Census_OSWUAutoUpdateOptionsName
- Census_InternalBatteryType
- Census_GenuineStateName
- Census_ActivationChannel
- Census_FlightRing
- Census_OSEdition

In [22]:
print(full_features["ProductName"].value_counts())

colname = "ProductName"
oldvalues = ['win8defender','mse','mseprerelease']
newvalues = [1,2,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'win8defender')

win8defender     197878
mse                2121
mseprerelease         1
Name: ProductName, dtype: int64
Previous values ['win8defender' 'mse' 'mseprerelease']
New values [1 2]


In [23]:
print(full_features["Platform"].value_counts())

colname = "Platform"
oldvalues = ['windows10','windows7','windows8','windows2016','Undefined']
newvalues = [10,7,8,2016,-1]

df_replacevalues(full_features, colname, oldvalues, newvalues,'Undefined')

windows10      193156
windows8         4435
windows7         2103
windows2016       306
Name: Platform, dtype: int64
Previous values ['windows10' 'windows8' 'windows7' 'windows2016']
New values [  10    8    7 2016]


In [24]:
print(full_features["Processor"].value_counts())

colname = "Processor"
oldvalues = ['x64','arm64','x86']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'x64')

x64      181813
x86       18182
arm64         5
Name: Processor, dtype: int64
Previous values ['x64' 'x86' 'arm64']
New values [1 3 2]


In [25]:
colname = "OsPlatformSubRelease"

print(full_features[colname].value_counts())

oldvalues = ['rs4','rs3','rs2','rs1','windows7','windows8.1','th1','th2','prers5','Unknown']
newvalues = [504,503,502,501,        407,408,                201,202,     505,     0]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Unknown')

rs4           87849
rs3           55907
rs2           17696
rs1           16391
th2            9179
th1            6004
windows8.1     4435
windows7       2103
prers5          436
Name: OsPlatformSubRelease, dtype: int64
Previous values ['rs3' 'windows8.1' 'rs2' 'rs4' 'rs1' 'th1' 'th2' 'windows7' 'prers5']
New values [503 408 502 504 501 201 202 407 505]


In [26]:
colname = "SkuEdition"

print(full_features[colname].value_counts())

oldvalues = ['Pro','Home','Invalid','Enterprise LTSB','Enterprise','Education','Cloud','Server']
newvalues = [55,52,0,71,70,20,90,80]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Invalid')

Home               123809
Pro                 72113
Invalid              1764
Education             928
Enterprise            767
Enterprise LTSB       429
Cloud                 110
Server                 80
Name: SkuEdition, dtype: int64
Previous values ['Home' 'Pro' 'Enterprise LTSB' 'Invalid' 'Education' 'Enterprise'
 'Server' 'Cloud']
New values [52 55 71  0 20 70 80 90]


In [27]:
colname = "SmartScreen"

print(full_features[colname].value_counts())

oldvalues = ['Off','off','OFF','On','on','Warn','Prompt','ExistsNotSet','Block','RequireAdmin']
newvalues = [0,0,0,1,1,2,3,4,5,6]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'ExistsNotSet')

RequireAdmin    96763
ExistsNotSet    23472
Off              4182
Warn             2991
Prompt            772
Block             484
off                28
On                 11
&#x02;              7
on                  6
&#x01;              5
0                   1
Name: SmartScreen, dtype: int64
Previous values ['ExistsNotSet' 'RequireAdmin' 'Block' 'Off' 'Warn' 'Prompt' 'off' 'On'
 'on']
New values [4 6 5 0 2 3 1]


In [28]:
colname = "Census_MDC2FormFactor"

print(full_features[colname].value_counts())

oldvalues = ['Desktop','Notebook','Detachable','PCOther','AllInOne','Convertible','SmallTablet','LargeTablet','SmallServer','LargeServer','MediumServer','ServerOther','Other']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

Notebook        128491
Desktop          43451
Convertible       9023
Detachable        6801
AllInOne          6594
PCOther           3165
LargeTablet       1531
SmallTablet        676
SmallServer        183
MediumServer        64
LargeServer         21
Name: Census_MDC2FormFactor, dtype: int64
Previous values ['Notebook' 'Desktop' 'Convertible' 'AllInOne' 'Detachable' 'PCOther'
 'LargeTablet' 'SmallTablet' 'MediumServer' 'SmallServer' 'LargeServer']
New values [ 2  1  6  5  3  4  8  7 11  9 10]


In [29]:
# Census_DeviceFamily ['Windows.Desktop' 'Windows.Server' 'Windows']

colname = "Census_DeviceFamily"

print(full_features[colname].value_counts())

oldvalues = ['Windows.Desktop','Windows.Server','Windows']
#newvalues = [i+1 for i in range(len(oldvalues))]
# Windows = Windows.Desktop
newvalues = [1,2,1]
    
df_replacevalues(full_features, colname, oldvalues, newvalues, 'Windows.Desktop')

Windows.Desktop    199694
Windows.Server        306
Name: Census_DeviceFamily, dtype: int64
Previous values ['Windows.Desktop' 'Windows.Server']
New values [1 2]


In [30]:
# Census_PrimaryDiskTypeName ['HDD' 'SSD' 'UNKNOWN' 'Unspecified' nan]

colname = "Census_PrimaryDiskTypeName"

print(full_features[colname].value_counts())

oldvalues = ['HDD','SSD','UNKNOWN','Unspecified']
newvalues = [1,2,3,3]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Unspecified')

HDD            129929
SSD             55416
UNKNOWN          8063
Unspecified      6289
Name: Census_PrimaryDiskTypeName, dtype: int64
Previous values ['HDD' 'UNKNOWN' 'SSD' 'Unspecified']
New values [1 3 2]


In [31]:
# Census_ChassisTypeName Index(['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 'RackMountChassis', 'SubNotebook', 'BusExpansionChassis', '30', 'StickPC', '0', 'MultisystemChassis', 'Blade', '35', 'PizzaBox', 'SealedCasePC', 'SubChassis', 'ExpansionChassis', '31', '32', '88', '127', '25', '44', '36', 'DockingStation', 'BladeEnclosure', 'CompactPCI', '81', '45', 'EmbeddedPC', '28', '82', '112', 'IoTGateway', '49', '76', '39'], dtype='object')

colname = "Census_ChassisTypeName"

print(full_features[colname].value_counts())

oldvalues = ['Notebook', 'Desktop', 'Laptop', 'Portable', 'AllinOne', 'MiniTower', 'Convertible', 'Other', 'UNKNOWN', 'Detachable', 
             'LowProfileDesktop', 'HandHeld', 'SpaceSaving', 'Tablet', 'Tower', 'Unknown', 'MainServerChassis', 'MiniPC', 'LunchBox', 
             'RackMountChassis', 'SubNotebook', 'BusExpansionChassis']
# Grouping Laptop/Notebook, unknown and other
newvalues = [1,2,1,3,4,5,6,0,0,7,
             8,9,10,11,12,0,13,14,15,
             16,1,17]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

Notebook               118044
Desktop                 41810
Laptop                  15344
Portable                 7910
AllinOne                 4530
MiniTower                1915
Convertible              1900
Other                    1679
UNKNOWN                  1549
Detachable               1156
LowProfileDesktop        1134
HandHeld                 1027
SpaceSaving               635
Tablet                    338
Tower                     292
MainServerChassis         222
Unknown                   204
MiniPC                     90
LunchBox                   89
RackMountChassis           76
SubNotebook                15
BusExpansionChassis        14
StickPC                     5
30                          4
SealedCasePC                2
0                           2
35                          2
MultisystemChassis          1
SubChassis                  1
Name: Census_ChassisTypeName, dtype: int64
Previous values ['Laptop' 'Notebook' 'Desktop' 'Convertible' 'AllinOne' 'Detachable'
 '

In [32]:
# Census_PowerPlatformRoleName Index(['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']

colname = "Census_PowerPlatformRoleName"

print(full_features[colname].value_counts())

oldvalues = ['Mobile', 'Desktop', 'Slate', 'Workstation', 'SOHOServer', 'UNKNOWN', 'EnterpriseServer', 'AppliancePC', 'PerformanceServer', 'Unspecified']
newvalues = [1,2,3,4,5,0,6,7,8,0]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

Mobile               138716
Desktop               46077
Slate                 11177
Workstation            2426
SOHOServer              844
UNKNOWN                 489
EnterpriseServer        156
AppliancePC             111
PerformanceServer         3
Name: Census_PowerPlatformRoleName, dtype: int64
Previous values ['Mobile' 'Desktop' 'Workstation' 'Slate' 'SOHOServer' 'UNKNOWN'
 'AppliancePC' 'EnterpriseServer' 'PerformanceServer']
New values [1 2 4 3 5 0 7 6 8]


In [33]:
# Census_OSArchitecture Index(['amd64', 'x86', 'arm64'], dtype='object')

colname = "Census_OSArchitecture"

print(full_features[colname].value_counts())

oldvalues = ['amd64', 'x86', 'arm64']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'amd64')

amd64    181804
x86       18191
arm64         5
Name: Census_OSArchitecture, dtype: int64
Previous values ['amd64' 'x86' 'arm64']
New values [1 3 2]


In [34]:
# Census_OSBranch Index(['rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 'rs2_release', 'rs1_release', 'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'rs5_release', 'rs3_release_svc_escrow_im', 'rs_prerelease', 'rs_prerelease_flt', 'rs5_release_sigma', 'rs1_release_srvmedia', 'winblue_ltsb_escrow', 'win7sp1_ldr', 'winblue_ltsb', 'win8_gdr', 'rs_xbox', 'rs5_release_edge', 'rs5_release_sigma_dev', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'rs_shell', 'rs1_release_svc', 'win8_ldr', 'rs_onecore_base_cobalt', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs3_release_svc', 'Khmer OS'], dtype='object')

colname = "Census_OSBranch"

print(full_features[colname].value_counts())

oldvalues = ['rs5_release', 'rs5_release_sigma', 'rs4_release', 'rs3_release', 'rs3_release_svc_escrow', 
             'rs3_release_svc_escrow_im', 'rs2_release', 'rs1_release', 'rs_prerelease', 'rs_prerelease_flt', 
             'th2_release', 'th2_release_sec', 'th1_st1', 'th1', 'Undefined']
newvalues = [25,25,24,23,23,23,22,21,20,20,
             12,12,11,11,0]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Undefined')

rs4_release                  89994
rs3_release                  27613
rs3_release_svc_escrow       26714
rs2_release                  18071
rs1_release                  17638
th2_release                   7369
th2_release_sec               5957
th1_st1                       4334
th1                           1715
rs5_release                    306
rs3_release_svc_escrow_im      150
rs_prerelease                   80
rs_prerelease_flt               56
rs5_release_sigma                2
rs_shell                         1
Name: Census_OSBranch, dtype: int64
Previous values ['rs3_release' 'th2_release' 'rs2_release' 'rs4_release'
 'rs3_release_svc_escrow' 'rs1_release' 'th1' 'th2_release_sec' 'th1_st1'
 'rs_prerelease' 'rs3_release_svc_escrow_im' 'rs5_release'
 'rs_prerelease_flt' 'rs5_release_sigma' 'Undefined']
New values [23 12 22 24 21 11 20 25  0]


In [35]:
# Census_OSSkuName Index(['CORE', 'PROFESSIONAL', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'EDUCATION', 'ENTERPRISE', 'PROFESSIONAL_N', 'ENTERPRISE_S', 'STANDARD_SERVER', 'CLOUD', 'CORE_N', 'STANDARD_EVALUATION_SERVER', 'EDUCATION_N', 'ENTERPRISE_S_N', 'DATACENTER_EVALUATION_SERVER', 'SB_SOLUTION_SERVER', 'ENTERPRISE_N', 'PRO_WORKSTATION', 'UNLICENSED', 'DATACENTER_SERVER', 'PRO_WORKSTATION_N', 'CLOUDN', 'PRO_CHINA', 'SERVERRDSH', 'ULTIMATE', 'PRO_FOR_EDUCATION', 'PRO_SINGLE_LANGUAGE', 'UNDEFINED', 'STARTER', 'ENTERPRISEG'], dtype='object')

colname = "Census_OSSkuName"
oldvalues = ['CORE', 'CORE_SINGLELANGUAGE', 'CORE_COUNTRYSPECIFIC', 'CORE_N',
             'EDUCATION', 'EDUCATION_N',
             'PROFESSIONAL', 'PROFESSIONAL_N', 'PRO_WORKSTATION',
             'ENTERPRISE',  'ENTERPRISE_S', 'ENTERPRISE_S_N', 'ENTERPRISE_N', 
             'CLOUD',
             'SB_SOLUTION_SERVER', 'STANDARD_SERVER', 'STANDARD_EVALUATION_SERVER', 'DATACENTER_EVALUATION_SERVER', 'UNLICENSED']
newvalues = [i+1 for i in range(len(oldvalues))]

# Group this feature by values

full_features['CORE'] = 1 if 'CORE' in full_features['Census_OSSkuName'] else 0
full_features['EDUCATION'] = 1 if 'EDUCATION' in full_features['Census_OSSkuName'] else 0
full_features['PRO'] = 1 if 'PRO' in full_features['Census_OSSkuName'] else 0
full_features['ENTERPRISE'] = 1 if 'ENTERPRISE' in full_features['Census_OSSkuName'] else 0
full_features['CLOUD'] = 1 if 'CLOUD' in full_features['Census_OSSkuName'] else 0
full_features['SERVER'] = 1 if 'SERVER' in full_features['Census_OSSkuName'] else 0
full_features['EVALUATION'] = 1 if 'EVALUATION' in full_features['Census_OSSkuName'] else 0

full_features.drop([colname], axis=1, inplace=True)


In [36]:
# Census_OSInstallTypeName Index(['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh'], dtype='object')

colname = "Census_OSInstallTypeName"

print(full_features[colname].value_counts())

oldvalues = ['UUPUpgrade', 'IBSClean', 'Update', 'Upgrade', 'Other', 'Reset', 'Refresh', 'Clean', 'CleanPCRefresh']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

UUPUpgrade        58533
IBSClean          36975
Update            35793
Upgrade           28058
Other             18894
Reset             14315
Refresh            4624
Clean              1561
CleanPCRefresh     1247
Name: Census_OSInstallTypeName, dtype: int64
Previous values ['Upgrade' 'Update' 'UUPUpgrade' 'Reset' 'Other' 'IBSClean' 'Clean'
 'CleanPCRefresh' 'Refresh']
New values [4 3 1 6 5 2 8 9 7]


In [37]:
# Census_OSWUAutoUpdateOptionsName Index(['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify'], dtype='object')

colname = "Census_OSWUAutoUpdateOptionsName"

print(full_features[colname].value_counts())

oldvalues = ['FullAuto', 'UNKNOWN', 'Notify', 'AutoInstallAndRebootAtMaintenanceTime', 'Off', 'DownloadNotify']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

FullAuto                                 88855
UNKNOWN                                  56286
Notify                                   45489
AutoInstallAndRebootAtMaintenanceTime     8451
Off                                        613
DownloadNotify                             306
Name: Census_OSWUAutoUpdateOptionsName, dtype: int64
Previous values ['FullAuto' 'AutoInstallAndRebootAtMaintenanceTime' 'Notify' 'UNKNOWN'
 'Off' 'DownloadNotify']
New values [1 4 3 2 5 6]


In [38]:
colname = "Census_InternalBatteryType"

print(full_features[colname].value_counts())

oldvalues = ['lion', 'li-i', '#', 'lip', 'unkn']
newvalues = [1,1,1,1,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'unkn')

lion    45483
li-i     5565
#        4109
lip      1455
liio      732
li p      195
li        142
nimh       94
real       58
bq20       47
pbac       46
vbox       28
unkn       12
lgi0        8
lipo        6
4cel        5
ithi        2
pad0        1
virt        1
lipp        1
ots0        1
lhp0        1
lgs0        1
ÿÿÿÿ        1
Name: Census_InternalBatteryType, dtype: int64
Previous values ['unkn' 'lion' 'lip' 'li-i' '#']
New values [2 1]


In [39]:
# Census_GenuineStateName Index(['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED'], dtype='object')

colname = "Census_GenuineStateName"

print(full_features[colname].value_counts())

oldvalues = ['IS_GENUINE', 'INVALID_LICENSE', 'OFFLINE', 'UNKNOWN', 'TAMPERED']
newvalues = [1,2,3,4,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'UNKNOWN')

IS_GENUINE         176673
INVALID_LICENSE     17887
OFFLINE              5094
UNKNOWN               346
Name: Census_GenuineStateName, dtype: int64
Previous values ['IS_GENUINE' 'INVALID_LICENSE' 'OFFLINE' 'UNKNOWN']
New values [1 2 3 4]


In [40]:
# Census_ActivationChannel Index(['Retail', 'OEM:DM', 'Volume:GVLK', 'OEM:NONSLP', 'Volume:MAK', 'Retail:TB:Eval'], dtype='object')

#Assigning separate values for Retail, OEM and Volume channels
colname = "Census_ActivationChannel"

print(full_features[colname].value_counts())

oldvalues = ['Retail', 'Retail:TB:Eval', 'OEM:DM', 'OEM:NONSLP', 'Volume:GVLK', 'Volume:MAK', 'Other']
#newvalues = [i+1 for i in range(len(oldvalues))]
newvalues = [1,1,2,2,3,3,4]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

Retail            105972
OEM:DM             76441
Volume:GVLK        10084
OEM:NONSLP          7237
Volume:MAK           186
Retail:TB:Eval        80
Name: Census_ActivationChannel, dtype: int64
Previous values ['Retail' 'OEM:DM' 'OEM:NONSLP' 'Volume:GVLK' 'Retail:TB:Eval'
 'Volume:MAK']
New values [1 2 3]


In [41]:
full_features['Census_FlightRing'].value_counts()

Retail      187550
NOT_SET       6372
Unknown       5319
RP             228
WIS            219
WIF            212
Disabled        99
OSG              1
Name: Census_FlightRing, dtype: int64

In [42]:
# Census_FlightRing Index(['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary'], dtype='object')

colname = "Census_FlightRing"

print(full_features[colname].value_counts())

oldvalues = ['Retail', 'NOT_SET', 'Disabled', 'Unknown']
newvalues = [1,2,2,3]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Unknown')

Retail      187550
NOT_SET       6372
Unknown       5319
RP             228
WIS            219
WIF            212
Disabled        99
OSG              1
Name: Census_FlightRing, dtype: int64
Previous values ['Retail' 'Unknown' 'NOT_SET' 'Disabled']
New values [1 3 2]


In [43]:
# Census_FlightRing Index(['Retail', 'NOT_SET', 'Unknown', 'WIS', 'WIF', 'RP', 'Disabled', 'OSG', 'Canary', 'Invalid', 'CBCanary'], dtype='object')

colname = "Census_OSEdition"

print(full_features[colname].value_counts())

oldvalues = ['Core','CoreSingleLanguage','CoreCountrySpecific','CoreN',
             'Professional','ProfessionalN','ProfessionalEducation','ProfessionalEducationN',
             'Education','EducationN',
             'Enterprise','EnterpriseS','EnterpriseSN','EnterpriseN',
             'ServerStandard','ServerStandardEval','ServerDatacenterEval','ServerSolution',
             'Cloud',
             'Other']
newvalues = [1,1,1,1,
             2,2,2,2,
             3,3,
             4,4,4,4,
             5,5,5,5,
             6,
             7]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'Other')

Core                       78081
Professional               70056
CoreSingleLanguage         43586
CoreCountrySpecific         3665
ProfessionalEducation       1250
Education                    937
Enterprise                   794
ProfessionalN                628
EnterpriseS                  415
ServerStandard               220
Cloud                        124
CoreN                        108
ServerStandardEval            64
EnterpriseSN                  21
EducationN                    17
ServerDatacenterEval          16
EnterpriseN                    6
ServerSolution                 6
ProfessionalEducationN         4
ProfessionalWorkstation        2
Name: Census_OSEdition, dtype: int64
Previous values ['CoreSingleLanguage' 'Core' 'Professional' 'ProfessionalEducation'
 'CoreCountrySpecific' 'EnterpriseSN' 'Education' 'Enterprise'
 'EnterpriseS' 'ServerStandardEval' 'ServerStandard' 'ProfessionalN'
 'Cloud' 'CoreN' 'ServerSolution' 'EducationN' 'ServerDatacenterEval'
 'Other' 'Profess

In [44]:
# PuaMode Index(['off', 'on', 'audit'], dtype='object')

#colname = "PuaMode"

#print(full_features[colname].value_counts())

#oldvalues = ['off', 'on', 'audit']
#newvalues = [0,1,2]

#df_replacevalues(full_features, colname, oldvalues, newvalues)

full_features.drop(['PuaMode','Census_ProcessorClass','DefaultBrowsersIdentifier'], axis=1, inplace=True)

In [45]:
# Now let us check the string columns again

string_columns = []

for colname in full_features.dtypes.keys():
    if full_features[colname].dtypes.name == "object":
        string_columns.append(colname)
        
string_columns

['EngineVersion',
 'AppVersion',
 'AvSigVersion',
 'RtpStateBitfield',
 'AVProductStatesIdentifier',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'CountryIdentifier',
 'CityIdentifier',
 'OrganizationIdentifier',
 'GeoNameIdentifier',
 'LocaleEnglishNameIdentifier',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsBuildLab',
 'IsProtected',
 'SMode',
 'IeVerIdentifier',
 'Firewall',
 'UacLuaenable',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorCoreCount',
 'Census_ProcessorManufacturerIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical',
 'Census_InternalBatteryNumberOfCharges',
 'Census_OSVersion',
 'Census_OSBuildNumber',
 'Census_OSBuildRevision',
 'Census_OSInstallLanguageIdentifier',
 'Census_OSUILocaleIdentifier',
 'Census_IsFlightingInternal',
 'Census_IsFlightsDisabled',
 'Census_ThresholdOptIn',
 'Cens

In [46]:
full_features[string_columns].head(10)

Unnamed: 0_level_0,EngineVersion,AppVersion,AvSigVersion,RtpStateBitfield,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,OsVer,OsBuild,OsSuite,OsBuildLab,IsProtected,SMode,IeVerIdentifier,Firewall,UacLuaenable,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1
fecbcefeb3c7a3a046263810525a81b5,1.1.15200.1,4.18.1807.18075,1.275.1236.0,7,7945,2,1,29,107470.0,48.0,35,171,10.0.0.0,16299,768,16299.15.amd64fre.rs3_release.170928-1534,1,0,135,1,1,4142,134956,4,5,2373,13.9,1366,768,126.0,10.0.16299.125,16299,125,26,119,,0,,142,51777,,0,0,0,10
d8df3d0556fa4a826fce204a9432e6dc,1.1.15100.1,4.10.209.0,1.273.1652.0,7,43747,2,2,141,92208.0,27.0,167,227,6.3.0.0,9600,768,9600.19101.amd64fre.winblue_ltsb_escrow.180718...,1,0,333,1,1,2206,242491,4,1,289,15.5,1366,768,0.0,10.0.10586.494,10586,494,8,31,,0,0.0,554,33000,0.0,0,0,0,10
24d97ff81e15ad492d842119044c0871,1.1.15100.1,4.18.1807.18075,1.273.642.0,7,53447,1,1,60,165694.0,27.0,274,182,10.0.0.0,15063,768,15063.0.amd64fre.rs2_release.170317-1834,1,0,108,1,1,585,190133,4,5,3327,16.3,1366,768,,10.0.15063.1206,15063,1206,9,34,,0,,93,51050,,0,0,1,15
71ce8a1e7833c1b026bf78c7cd2f181d,1.1.15200.1,4.18.1807.18075,1.275.1001.0,7,22728,2,1,120,120697.0,,144,140,10.0.0.0,16299,768,16299.15.amd64fre.rs3_release.170928-1534,1,0,117,1,1,2668,170943,4,5,2241,13.2,3200,1800,0.0,10.0.16299.371,16299,371,8,31,0.0,0,0.0,628,21399,0.0,0,0,0,3
f5299e96739a95dc543df78f43d284b6,1.1.15100.1,4.18.1807.18075,1.273.1682.0,7,53447,1,1,207,49499.0,27.0,277,75,10.0.0.0,15063,256,15063.0.amd64fre.rs2_release.170317-1834,1,0,108,1,1,1980,333856,8,5,2951,27.0,1920,1080,4294967295.0,10.0.15063.1206,15063,1206,8,31,0.0,0,0.0,142,35595,0.0,0,0,1,13
7a025c72d473fe85a9799cf9b12a52e7,1.1.15100.1,4.18.1807.18075,1.273.810.0,7,47238,2,1,158,36285.0,48.0,202,70,10.0.0.0,16299,768,16299.15.amd64fre.rs3_release.170928-1534,1,0,117,1,1,2668,171221,4,5,2373,13.9,1366,768,0.0,10.0.16299.371,16299,371,8,31,,0,0.0,628,21675,0.0,0,0,0,1
b5062d8c7f2fb750df1be60e22ff2e04,1.1.14600.4,4.13.17134.228,1.263.48.0,7,3371,2,1,120,120697.0,,144,139,10.0.0.0,17134,768,17134.1.amd64fre.rs4_release.180410-1804,1,0,137,1,1,4730,302480,4,5,2574,15.5,1366,768,188.0,10.0.17134.228,17134,228,7,30,0.0,0,0.0,513,9011,0.0,0,0,0,3
2439cf6f16a54e2f1f30e63ef5366733,1.1.15100.1,4.18.1806.18062,1.273.483.0,0,57629,2,1,214,146748.0,27.0,287,75,10.0.0.0,16299,256,16299.431.amd64fre.rs3_release_svc_escrow.1805...,1,0,117,1,1,585,318034,4,5,2660,13.9,1920,1080,0.0,10.0.16299.492,16299,492,8,31,,0,,556,63317,,0,0,0,1
d4e5f2acb25087b70b8bcfa1b1ea2b9b,1.1.15200.1,4.18.1807.18075,1.275.1327.0,7,49480,2,1,95,,18.0,121,168,10.0.0.0,14393,768,14393.187.amd64fre.rs1_release_inmarket.160906...,1,0,94,1,1,585,313257,4,5,3499,11.6,1920,1080,0.0,10.0.14393.187,14393,187,8,42,,0,,556,63041,,0,0,0,11
7e3e226dd18552a7847ba6491ccb1fbb,1.1.15200.1,4.18.1807.18075,1.275.843.0,7,53447,1,1,61,923.0,,277,75,10.0.0.0,16299,768,16299.15.amd64fre.rs3_release.170928-1534,1,0,111,1,1,2102,241876,4,5,2412,15.5,1366,768,0.0,10.0.16299.192,16299,192,8,31,,0,,554,33135,,0,0,0,11


In [47]:
# Now we need to process the columns that contain version numbers
# We will split them in 4-5 different columns

versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']
newcolumnnames = []

for colname in versions:
    data = full_features[colname].str.split(r"\.|-",expand=True) # Split if '.' or '-'
    for i in range(data.shape[1]):
        newcolumnname = "%s_%d" % (colname, i+1)
        newcolumnnames.append(newcolumnname)
        full_features[newcolumnname] = data[i]

In [48]:
full_features[newcolumnnames].head(10)

Unnamed: 0_level_0,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
fecbcefeb3c7a3a046263810525a81b5,1,1,15200,1,4,18,1807,18075,1,275,1236,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,125
d8df3d0556fa4a826fce204a9432e6dc,1,1,15100,1,4,10,209,0,1,273,1652,0,6,3,0,0,9600,19101,amd64fre,winblue_ltsb_escrow,180718,1800,10,0,10586,494
24d97ff81e15ad492d842119044c0871,1,1,15100,1,4,18,1807,18075,1,273,642,0,10,0,0,0,15063,0,amd64fre,rs2_release,170317,1834,10,0,15063,1206
71ce8a1e7833c1b026bf78c7cd2f181d,1,1,15200,1,4,18,1807,18075,1,275,1001,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,371
f5299e96739a95dc543df78f43d284b6,1,1,15100,1,4,18,1807,18075,1,273,1682,0,10,0,0,0,15063,0,amd64fre,rs2_release,170317,1834,10,0,15063,1206
7a025c72d473fe85a9799cf9b12a52e7,1,1,15100,1,4,18,1807,18075,1,273,810,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,371
b5062d8c7f2fb750df1be60e22ff2e04,1,1,14600,4,4,13,17134,228,1,263,48,0,10,0,0,0,17134,1,amd64fre,rs4_release,180410,1804,10,0,17134,228
2439cf6f16a54e2f1f30e63ef5366733,1,1,15100,1,4,18,1806,18062,1,273,483,0,10,0,0,0,16299,431,amd64fre,rs3_release_svc_escrow,180502,1908,10,0,16299,492
d4e5f2acb25087b70b8bcfa1b1ea2b9b,1,1,15200,1,4,18,1807,18075,1,275,1327,0,10,0,0,0,14393,187,amd64fre,rs1_release_inmarket,160906,1818,10,0,14393,187
7e3e226dd18552a7847ba6491ccb1fbb,1,1,15200,1,4,18,1807,18075,1,275,843,0,10,0,0,0,16299,15,amd64fre,rs3_release,170928,1534,10,0,16299,192


In [49]:
#colname = "OsBuildLab_4"
#print (full_features[colname].value_counts())
#print (colname, full_features[colname].value_counts().keys())

In [50]:
# After splitting the columns, the only values we need to remap are OsBuildLab_3 and OsBuildLab_4
# Other values are already numeric

# OsBuildLab_3 Index(['amd64fre', 'x86fre', 'arm64fre'], dtype='object')

colname = "OsBuildLab_3"
oldvalues = ['amd64fre', 'x86fre', 'arm64fre']
newvalues = [1,3,2]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'amd64fre')

Previous values ['amd64fre' 'x86fre' 'arm64fre']
New values [1 3 2]


In [51]:
# OsBuildLab_4 Index(['rs4_release', 'rs3_release_svc_escrow', 'rs3_release', 'rs2_release', 'rs1_release', 'th2_release_sec', 'th1', 'winblue_ltsb_escrow', 'th2_release', 'rs1_release_inmarket', 'winblue_ltsb', 'win7sp1_ldr', 'rs3_release_svc', 'rs1_release_1', 'win7sp1_ldr_escrow', 'rs1_release_sec', 'th1_st1', 'rs5_release', 'rs1_release_inmarket_aim', 'rs3_release_svc_escrow_im', 'th2_release_inmarket', 'rs_prerelease', 'rs_prerelease_flt', 'win7sp1_gdr', 'winblue_gdr', 'th1_escrow', 'win7_gdr', 'winblue_r4', 'rs1_release_inmarket_rim', 'rs1_release_d', 'winblue_r9', 'winblue_r5', 'win7_rtm', 'win7sp1_rtm', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'rs5_release_sigma', 'win7_ldr', 'rs5_release_sigma_dev', 'rs_xbox', 'rs5_release_edge', 'winblue_rtm', 'win7sp1_rc', 'rs3_release_svc_sec', 'rs_onecore_base_cobalt', 'rs6_prerelease', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs5_release_sign', 'rs_shell']

colname = "OsBuildLab_4"
oldvalues = ['rs6_prerelease',
             'rs5_release', 'rs5_release_sigma', 'rs5_release_sigma_dev', 'rs5_release_edge', 'rs5_release_sign',
             'rs4_release', 
             'rs3_release_svc_escrow', 'rs3_release', 'rs3_release_svc', 'rs3_release_svc_escrow_im', 'rs3_release_svc_sec', 
             'rs2_release', 
             'rs1_release', 'rs1_release_inmarket', 'rs1_release_1', 'rs1_release_sec', 'rs1_release_inmarket_aim', 'rs1_release_inmarket_rim', 'rs1_release_d', 
             'rs_prerelease', 'rs_prerelease_flt',
             'th2_release_sec', 'th2_release', 'th2_release_inmarket', 
             'th1', 'th1_st1', 'th1_escrow', 
             'winblue_ltsb_escrow', 'winblue_ltsb', 'winblue_gdr', 'winblue_r4', 'winblue_r7', 'winblue_r3', 'winblue_r8', 'winblue_r9', 'winblue_r5', 'winblue_rtm',
             'win7sp1_ldr', 'win7sp1_ldr_escrow', 'win7sp1_gdr', 'win7_gdr', 'win7_rtm', 'win7sp1_rtm', 'win7_ldr', 'win7sp1_rc', 
             'rs_xbox', 'rs_onecore_base_cobalt', 'rs_onecore_sigma_grfx_dev', 'rs_onecore_stack_per1', 'rs_shell',
             'other']
newvalues = [i+1 for i in range(len(oldvalues))]

df_replacevalues(full_features, colname, oldvalues, newvalues, 'other')

Previous values ['rs3_release' 'winblue_ltsb_escrow' 'rs2_release' 'rs4_release'
 'rs3_release_svc_escrow' 'rs1_release_inmarket' 'rs1_release'
 'winblue_ltsb' 'th1' 'th2_release_sec' 'win7sp1_ldr' 'th2_release'
 'rs1_release_1' 'rs_prerelease' 'rs3_release_svc_escrow_im' 'rs5_release'
 'rs1_release_sec' 'win7sp1_ldr_escrow' 'rs3_release_svc' 'th1_st1'
 'rs1_release_inmarket_rim' 'th2_release_inmarket'
 'rs1_release_inmarket_aim' 'winblue_r4' 'rs_prerelease_flt' 'winblue_r5'
 'win7sp1_gdr' 'winblue_gdr' 'th1_escrow' 'winblue_r9' 'rs5_release_sigma'
 'win7_rtm' 'winblue_r7' 'winblue_r8' 'win7_gdr' 'win7sp1_rtm'
 'rs1_release_d' 'rs_shell' 'other']
New values [ 9 29 13  7  8 15 14 30 26 23 39 24 16 21 11  2 17 40 10 27 19 25 18 32
 22 37 41 31 28 36  3 43 33 35 42 44 20 51 52]


In [52]:
# Version 1.2.3.4 vas converted to columns

# Version_1 = 1
# Version_2 = 2
# Version_3 = 3
# Version_4= 4
# So the column Version is not needed any more

versions = ['EngineVersion','AppVersion','AvSigVersion','OsVer','OsBuildLab','Census_OSVersion']

full_features = full_features.drop(versions, axis=1)

In [53]:
# Modify all columns which are not interger type to integer and replace NaN/NULL values with -1

for colname in full_features.columns:
    if full_features[colname].dtypes.name not in ["int8","int16","int32"]:
        #topvalue = full_features[colname].value_counts().idxmax()
        topvalue = -1
        full_features[colname].fillna(topvalue, inplace=True)
        full_features[colname] = pd.to_numeric(full_features[colname], errors='coerce')
        full_features[colname].fillna(topvalue, inplace=True)
        

In [54]:
full_features.head(10)

Unnamed: 0_level_0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,CORE,EDUCATION,PRO,ENTERPRISE,CLOUD,SERVER,EVALUATION,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
MachineIdentifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1
fecbcefeb3c7a3a046263810525a81b5,1,0,7,0,7945,2,1,1,29,107470,48,35,171,10,1,16299,768,503,52,1,0,0,135,4,1,1,2,1,4142,134956,4,5,2373,953869.0,1,930110.0,0,4096.0,1,13.9,1366,768,1,2,126.0,1,23,16299,125,1,4,26,119,1,0,1,1,-1,0,1,-1,142,51777,1,-1,0,0,0,0,0,10,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,1236,0,10,0,0,0,16299,15,1,9,170928,1534,10,0,16299,125
d8df3d0556fa4a826fce204a9432e6dc,1,0,7,0,43747,2,2,1,141,92208,27,167,227,8,1,9600,768,408,52,1,0,0,333,4,1,1,2,1,2206,242491,4,1,289,953869.0,1,924425.0,0,6144.0,1,15.5,1366,768,1,1,0.0,1,12,10586,494,1,3,8,31,1,0,1,1,-1,0,1,0,554,33000,1,0,0,0,0,0,0,10,1,0,0,0,0,0,0,0,1,1,15100,1,4,10,209,0,1,273,1652,0,6,3,0,0,9600,19101,1,29,180718,1800,10,0,10586,494
24d97ff81e15ad492d842119044c0871,1,0,7,0,53447,1,1,1,60,165694,27,274,182,10,1,15063,768,502,52,1,0,0,108,6,1,1,1,1,585,190133,4,5,3327,610480.0,3,297763.0,0,4096.0,2,16.3,1366,768,2,2,-1.0,1,22,15063,1206,1,3,9,34,4,0,1,1,-1,0,1,-1,93,51050,0,-1,0,0,0,0,1,15,1,0,0,0,0,0,0,0,1,1,15100,1,4,18,1807,18075,1,273,642,0,10,0,0,0,15063,0,1,13,170317,1834,10,0,15063,1206
71ce8a1e7833c1b026bf78c7cd2f181d,1,0,7,0,22728,2,1,1,120,120697,-1,144,140,10,1,16299,768,503,52,1,0,0,117,6,1,1,6,1,2668,170943,4,5,2241,244198.0,2,201954.0,0,8192.0,1,13.2,3200,1800,1,1,0.0,1,23,16299,371,1,3,8,31,1,0,1,1,0,0,1,0,628,21399,1,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,1001,0,10,0,0,0,16299,15,1,9,170928,1534,10,0,16299,371
f5299e96739a95dc543df78f43d284b6,1,0,7,0,53447,1,1,1,207,49499,27,277,75,10,1,15063,256,502,55,1,0,0,108,4,1,1,1,1,1980,333856,8,5,2951,228936.0,2,227958.0,0,16384.0,2,27.0,1920,1080,2,2,4294967000.0,1,22,15063,1206,2,3,8,31,4,0,1,1,0,0,1,0,142,35595,0,0,0,0,0,0,1,13,1,0,0,0,0,0,0,0,1,1,15100,1,4,18,1807,18075,1,273,1682,0,10,0,0,0,15063,0,1,13,170317,1834,10,0,15063,1206
7a025c72d473fe85a9799cf9b12a52e7,1,0,7,0,47238,2,1,1,158,36285,48,202,70,10,1,16299,768,503,52,1,0,0,117,6,1,1,2,1,2668,171221,4,5,2373,953869.0,1,906196.0,0,4096.0,1,13.9,1366,768,1,1,0.0,1,23,16299,371,1,3,8,31,3,0,1,2,-1,0,1,0,628,21675,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,15100,1,4,18,1807,18075,1,273,810,0,10,0,0,0,16299,15,1,9,170928,1534,10,0,16299,371
b5062d8c7f2fb750df1be60e22ff2e04,1,0,7,0,3371,2,1,1,120,120697,-1,144,139,10,1,17134,768,504,52,1,0,0,137,6,1,1,2,1,4730,302480,4,5,2574,476940.0,1,199154.0,0,8192.0,1,15.5,1366,768,1,1,188.0,1,24,17134,228,1,1,7,30,1,0,1,2,0,0,1,0,513,9011,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,1,1,14600,4,4,13,17134,228,1,263,48,0,10,0,0,0,17134,1,1,7,180410,1804,10,0,17134,228
2439cf6f16a54e2f1f30e63ef5366733,1,0,0,1,57629,2,1,1,214,146748,27,287,75,10,1,16299,256,503,55,1,0,0,117,4,1,1,2,1,585,318034,4,5,2660,953869.0,1,952728.0,0,4096.0,1,13.9,1920,1080,1,2,0.0,1,23,16299,492,2,6,8,31,2,0,1,1,-1,0,1,-1,556,63317,1,-1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,15100,1,4,18,1806,18062,1,273,483,0,10,0,0,0,16299,431,1,8,180502,1908,10,0,16299,492
d4e5f2acb25087b70b8bcfa1b1ea2b9b,1,0,7,0,49480,2,1,1,95,-1,18,121,168,10,1,14393,768,501,52,1,0,0,94,6,1,1,6,1,585,313257,4,5,3499,476940.0,1,475799.0,0,4096.0,6,11.6,1920,1080,1,2,0.0,1,21,14393,187,1,5,8,42,2,0,1,2,-1,0,1,-1,556,63041,1,-1,0,1,0,0,0,11,0,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,1327,0,10,0,0,0,14393,187,1,15,160906,1818,10,0,14393,187
7e3e226dd18552a7847ba6491ccb1fbb,1,0,7,0,53447,1,1,1,61,923,-1,277,75,10,1,16299,768,503,52,1,0,0,111,4,1,1,2,1,2102,241876,4,5,2412,953869.0,1,953318.0,0,4096.0,1,15.5,1366,768,1,2,0.0,1,23,16299,192,1,2,8,31,2,0,1,2,-1,0,1,-1,554,33135,0,-1,0,1,0,0,0,11,1,0,0,0,0,0,0,0,1,1,15200,1,4,18,1807,18075,1,275,843,0,10,0,0,0,16299,15,1,9,170928,1534,10,0,16299,192


In [55]:
# Let's see some details of the loaded data
full_features.describe()

Unnamed: 0,ProductName,IsBeta,RtpStateBitfield,IsSxsPassiveMode,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsBuild,OsSuite,OsPlatformSubRelease,SkuEdition,IsProtected,AutoSampleOptIn,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,CORE,EDUCATION,PRO,ENTERPRISE,CLOUD,SERVER,EVALUATION,EngineVersion_1,EngineVersion_2,EngineVersion_3,EngineVersion_4,AppVersion_1,AppVersion_2,AppVersion_3,AppVersion_4,AvSigVersion_1,AvSigVersion_2,AvSigVersion_3,AvSigVersion_4,OsVer_1,OsVer_2,OsVer_3,OsVer_4,OsBuildLab_1,OsBuildLab_2,OsBuildLab_3,OsBuildLab_4,OsBuildLab_5,OsBuildLab_6,Census_OSVersion_1,Census_OSVersion_2,Census_OSVersion_3,Census_OSVersion_4
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,1.01061,5e-06,6.810345,0.01783,47665.80669,1.31751,1.013095,0.988045,107.70644,78297.469535,16.947115,169.42028,122.63257,12.993285,1.181845,15719.255335,575.68686,477.194715,52.61646,0.938135,2.5e-05,-0.060105,125.80388,4.851825,0.958505,32.77753,2.200725,1.00153,2197.936345,236553.66566,3.95746,4.50291,2359.000115,509558.7,1.42363,375215.9,0.07704,6054.010815,1.63381,16.568027,1540.278215,893.343825,1.40031,1.71694,1081537000.0,1.181935,22.08829,15834.23192,970.01053,1.39705,2.942665,14.50583,60.449305,1.882995,0.000545,1.145565,1.52109,-0.829565,-0.01828,1.092145,-0.635145,394.71607,32380.647395,0.48744,-0.63441,0.004875,0.12539,0.038495,0.04933,0.24146,7.580265,0.50111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15075.51593,1.2988,4.0,15.862845,5656.14016,14078.304035,1.0,272.37635,934.442405,0.0,9.86924,0.07704,0.011285,0.00036,15719.15632,1422.684055,1.181845,10.730735,176490.53125,1776.736745,10.0,0.0,15834.23192,970.01053
std,0.102457,0.002236,1.14742,0.132334,14319.682324,0.542908,0.20903,0.108684,62.992782,50305.501322,12.784253,89.437124,69.310755,78.409371,0.574977,2190.5075,247.944153,80.796638,5.881374,0.256394,0.005,0.239463,43.949892,1.270168,0.245038,14214.82,1.315448,0.039085,1326.936207,75940.97394,2.070867,1.34195,856.583242,358564.7,0.625076,324260.3,0.266656,5076.815511,1.556055,5.973556,380.81731,223.543692,0.716264,0.450486,1864259000.0,0.575105,3.534737,1961.084518,2921.853748,0.563101,1.817722,10.253486,45.012456,0.937614,0.023339,0.430925,0.593512,0.376015,0.134,0.378741,0.481567,227.148426,21465.565322,0.499843,0.481596,0.092958,0.331161,0.192388,0.252224,0.500618,4.750952,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,275.185531,1.025585,0.0,3.321983,6265.05682,7389.076715,0.0,5.962642,531.950275,0.0,0.711298,0.451836,0.224205,0.160997,2190.786748,4616.2809,0.574977,6.338539,6056.005388,211.884535,0.0,0.0,1961.084518,2921.853748
min,1.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,-1.0,-1.0,-1.0,1.0,7.0,1.0,7600.0,16.0,201.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,-1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.0,-1.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,-1.0,1.0,0.0,10240.0,0.0,1.0,1.0,-1.0,5.0,1.0,0.0,1.0,1.0,-1.0,-1.0,1.0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,10302.0,0.0,4.0,4.0,204.0,0.0,1.0,167.0,0.0,0.0,6.0,0.0,0.0,0.0,-1.0,-1.0,1.0,2.0,-1.0,-1.0,10.0,0.0,10240.0,0.0
25%,1.0,0.0,7.0,0.0,49480.0,1.0,1.0,1.0,51.0,31368.0,-1.0,89.0,74.0,10.0,1.0,15063.0,256.0,502.0,52.0,1.0,0.0,0.0,108.0,4.0,1.0,1.0,2.0,1.0,1443.0,189550.0,2.0,5.0,1998.0,238475.0,1.0,120569.0,0.0,4096.0,1.0,13.9,1366.0,768.0,1.0,1.0,0.0,1.0,22.0,15063.0,167.0,1.0,1.0,8.0,31.0,1.0,0.0,1.0,1.0,-1.0,0.0,1.0,-1.0,142.0,12463.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15100.0,1.0,4.0,13.0,1807.0,17443.0,1.0,273.0,498.0,0.0,10.0,0.0,0.0,0.0,15063.0,1.0,1.0,7.0,170928.0,1804.0,10.0,0.0,15063.0,167.0
50%,1.0,0.0,7.0,0.0,53447.0,1.0,1.0,1.0,97.0,77866.0,18.0,181.0,88.0,10.0,1.0,16299.0,768.0,503.0,52.0,1.0,0.0,0.0,117.0,4.0,1.0,1.0,2.0,1.0,2102.0,246130.0,4.0,5.0,2498.0,476940.0,1.0,248652.0,0.0,4096.0,1.0,15.5,1366.0,768.0,1.0,2.0,0.0,1.0,23.0,16299.0,285.0,1.0,3.0,9.0,34.0,2.0,0.0,1.0,1.0,-1.0,0.0,1.0,-1.0,486.0,33054.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15100.0,1.0,4.0,18.0,1807.0,18075.0,1.0,273.0,948.0,0.0,10.0,0.0,0.0,0.0,16299.0,1.0,1.0,8.0,180410.0,1804.0,10.0,0.0,16299.0,285.0
75%,1.0,0.0,7.0,0.0,53447.0,2.0,1.0,1.0,160.0,121082.25,27.0,267.0,182.0,10.0,1.0,17134.0,768.0,504.0,55.0,1.0,0.0,0.0,137.0,6.0,1.0,1.0,2.0,1.0,2668.0,303293.0,4.0,5.0,2867.0,953869.0,2.0,475965.0,0.0,8192.0,2.0,17.2,1920.0,1080.0,2.0,2.0,4294967000.0,1.0,24.0,17134.0,547.0,2.0,4.0,20.0,90.0,3.0,0.0,1.0,2.0,-1.0,0.0,1.0,0.0,556.0,52207.25,1.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15200.0,1.0,4.0,18.0,10586.0,18075.0,1.0,275.0,1379.0,0.0,10.0,0.0,0.0,0.0,17134.0,431.0,1.0,13.0,180410.0,1834.0,10.0,0.0,17134.0,547.0
max,2.0,1.0,35.0,1.0,70486.0,6.0,5.0,1.0,222.0,167953.0,52.0,296.0,283.0,2016.0,3.0,18241.0,784.0,505.0,90.0,1.0,1.0,1.0,429.0,6.0,1.0,6357062.0,11.0,2.0,6143.0,345490.0,80.0,10.0,4472.0,14305280.0,3.0,9536968.0,1.0,524288.0,17.0,173.5,8192.0,4500.0,8.0,2.0,4294967000.0,3.0,25.0,18242.0,17976.0,7.0,9.0,39.0,161.0,6.0,1.0,4.0,3.0,0.0,1.0,3.0,1.0,1084.0,72080.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15300.0,6.0,4.0,18.0,17686.0,20082.0,1.0,277.0,4320.0,0.0,10.0,3.0,80.0,72.0,18241.0,24236.0,3.0,52.0,180914.0,2340.0,10.0,0.0,18242.0,17976.0


In [56]:
full_features['UacLuaenable'].unique()

array([      1,       0,      -1,      48, 6357062])

In [57]:
full_features.to_csv('regularized.csv')

In [58]:
full_features.dtypes

ProductName                                             int8
IsBeta                                                  int8
RtpStateBitfield                                       int64
IsSxsPassiveMode                                        int8
AVProductStatesIdentifier                              int64
AVProductsInstalled                                    int64
AVProductsEnabled                                      int64
HasTpm                                                  int8
CountryIdentifier                                      int64
CityIdentifier                                         int64
OrganizationIdentifier                                 int64
GeoNameIdentifier                                      int64
LocaleEnglishNameIdentifier                            int64
Platform                                               int16
Processor                                               int8
OsBuild                                                int64
OsSuite                 

In [59]:
# Shuffle the data

shuffle = np.random.permutation(np.arange(full_features.shape[0]))
indexes = full_features.index[shuffle]

full_features = full_features.loc[indexes,:]

In [60]:
full_labels = full_features["HasDetections"]

# Dropping labels ["HasDetections"] from training dataset
full_features = full_features.drop(["HasDetections"], axis=1)

In [61]:
# Prepare Train and test features and labels
train_count = int(len(full_features) * 0.8)

train_features = full_features.values[:train_count]
test_features  = full_features.values[train_count:]

train_labels = full_labels.values[:train_count]
test_labels = full_labels.values[train_count:]

In [62]:
train_features.shape

(160000, 104)

In [63]:
test_features.shape

(40000, 104)

In [64]:
scaler = StandardScaler()
scaler.fit(train_features)
normalized_train_features = scaler.transform(train_features)
normalized_test_features = scaler.transform(test_features)

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(normalized_train_features, train_labels)
all_columns_score = clf.score(normalized_test_features, test_labels)
    
print ("All columns (normalized)", train_features.shape, test_features.shape, train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", all_columns_score*100)

All columns (normalized) (160000, 104) (40000, 104) (160000,) (40000,) HistGradientBoostingClassifier 64.1


In [65]:
bruteforced_columns = ['ProductName', 'IsBeta', 'RtpStateBitfield', 'IsSxsPassiveMode',
       'AVProductStatesIdentifier', 'AVProductsInstalled', 'AVProductsEnabled',
       'CountryIdentifier', 'CityIdentifier', 'OrganizationIdentifier',
       'GeoNameIdentifier', 'LocaleEnglishNameIdentifier', 'Platform',
       'Processor', 'OsSuite', 'OsPlatformSubRelease', 'SkuEdition',
       'IsProtected', 'AutoSampleOptIn', 'SMode', 'IeVerIdentifier',
       'SmartScreen', 'Firewall', 'UacLuaenable', 'Census_MDC2FormFactor',
       'Census_DeviceFamily', 'Census_OEMNameIdentifier',
       'Census_ProcessorManufacturerIdentifier',
       'Census_ProcessorModelIdentifier', 'Census_PrimaryDiskTotalCapacity',
       'Census_PrimaryDiskTypeName', 'Census_SystemVolumeTotalCapacity',
       'Census_HasOpticalDiskDrive', 'Census_TotalPhysicalRAM',
       'Census_ChassisTypeName',
       'Census_InternalPrimaryDiagonalDisplaySizeInInches',
       'Census_InternalPrimaryDisplayResolutionHorizontal',
       'Census_InternalPrimaryDisplayResolutionVertical',
       'Census_PowerPlatformRoleName', 'Census_InternalBatteryNumberOfCharges',
       'Census_OSArchitecture', 'Census_OSBranch', 'Census_OSBuildNumber',
       'Census_OSBuildRevision', 'Census_OSEdition',
       'Census_OSInstallTypeName', 'Census_OSInstallLanguageIdentifier',
       'Census_OSUILocaleIdentifier', 'Census_OSWUAutoUpdateOptionsName',
       'Census_IsPortableOperatingSystem', 'Census_GenuineStateName',
       'Census_ActivationChannel', 'Census_IsFlightsDisabled',
       'Census_FlightRing', 'Census_ThresholdOptIn',
       'Census_FirmwareManufacturerIdentifier',
       'Census_FirmwareVersionIdentifier', 'Census_IsSecureBootEnabled',
       'Census_IsWIMBootEnabled', 'Census_IsVirtualDevice',
       'Census_IsTouchEnabled', 'Census_IsPenCapable',
       'Census_IsAlwaysOnAlwaysConnectedCapable', 'Wdft_IsGamer',
       'Wdft_RegionIdentifier', 'EngineVersion_1', 'EngineVersion_2',
       'EngineVersion_3', 'EngineVersion_4', 'AppVersion_1', 'AppVersion_2',
       'AppVersion_3', 'AppVersion_4', 'AvSigVersion_1', 'AvSigVersion_2',
       'AvSigVersion_3', 'AvSigVersion_4', 'OsVer_1', 'OsVer_2', 'OsVer_3',
       'OsVer_4', 'OsBuildLab_1', 'OsBuildLab_2', 'OsBuildLab_3',
       'OsBuildLab_4', 'OsBuildLab_5', 'OsBuildLab_6', 'Census_OSVersion_1',
       'Census_OSVersion_2', 'Census_OSVersion_3', 'Census_OSVersion_4',
       'CORE', 'EDUCATION', 'PRO', 'ENTERPRISE', 'CLOUD', 'SERVER', 'EVALUATION']

bruteforced_train_features = full_features[bruteforced_columns].values[:train_count]
bruteforced_test_features  = full_features[bruteforced_columns].values[train_count:]

print ("Bruteforced", bruteforced_train_features.shape, bruteforced_test_features.shape, train_labels.shape, test_labels.shape)

Bruteforced (160000, 98) (40000, 98) (160000,) (40000,)


In [66]:
# Run HistGradientBoostingClassifier on Bruteforced training and test data

clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(bruteforced_train_features, train_labels)
bruteforced_columns_score = clf.score(bruteforced_test_features, test_labels)
    
print ("Bruteforced", bruteforced_train_features.shape, bruteforced_test_features.shape, 
       train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", bruteforced_columns_score*100)

Bruteforced (160000, 98) (40000, 98) (160000,) (40000,) HistGradientBoostingClassifier 64.0725


In [67]:
full_features[bruteforced_columns].dtypes

ProductName                                             int8
IsBeta                                                  int8
RtpStateBitfield                                       int64
IsSxsPassiveMode                                        int8
AVProductStatesIdentifier                              int64
AVProductsInstalled                                    int64
AVProductsEnabled                                      int64
CountryIdentifier                                      int64
CityIdentifier                                         int64
OrganizationIdentifier                                 int64
GeoNameIdentifier                                      int64
LocaleEnglishNameIdentifier                            int64
Platform                                               int16
Processor                                               int8
OsSuite                                                int64
OsPlatformSubRelease                                   int16
SkuEdition              

In [68]:
engineered_columns = bruteforced_columns

full_features['ScreenProportion'] = full_features['Census_InternalPrimaryDisplayResolutionHorizontal'] / full_features['Census_InternalPrimaryDisplayResolutionVertical']
full_features['ScreenDimensions'] = (full_features['Census_InternalPrimaryDisplayResolutionHorizontal'] * 10000) + full_features['Census_InternalPrimaryDisplayResolutionVertical']
full_features['CapacityDifference'] = full_features['Census_SystemVolumeTotalCapacity'] / full_features['Census_PrimaryDiskTotalCapacity']
full_features['CapacityRatio'] = full_features['Census_SystemVolumeTotalCapacity'] - full_features['Census_PrimaryDiskTotalCapacity']
full_features['RAMByCores'] = full_features['Census_TotalPhysicalRAM'] / full_features['Census_ProcessorCoreCount'] 

full_features['ScreenProportion'] = full_features['ScreenProportion'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['ScreenDimensions'] = full_features['ScreenDimensions'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['CapacityDifference'] = full_features['CapacityDifference'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['CapacityRatio'] = full_features['CapacityRatio'].replace([np.inf, -np.inf], np.nan).fillna(-1)
full_features['RAMByCores'] = full_features['RAMByCores'].replace([np.inf, -np.inf], np.nan).fillna(-1)

engineered_columns.extend(['ScreenProportion', 'ScreenDimensions','CapacityDifference','CapacityRatio','RAMByCores'])

engineered_train_features = full_features[engineered_columns].values[:train_count]
engineered_test_features  = full_features[engineered_columns].values[train_count:]

print ("Engineered", engineered_train_features.shape, engineered_test_features.shape, train_labels.shape, test_labels.shape)

Engineered (160000, 103) (40000, 103) (160000,) (40000,)


In [69]:
clf = ske.HistGradientBoostingClassifier(random_state=123)
clf.fit(engineered_train_features, train_labels)
engineered_columns_score = clf.score(engineered_test_features, test_labels)
    
print ("Engineered", engineered_train_features.shape, engineered_test_features.shape, 
       train_labels.shape, test_labels.shape, "HistGradientBoostingClassifier", engineered_columns_score*100)

Engineered (160000, 103) (40000, 103) (160000,) (40000,) HistGradientBoostingClassifier 64.0075


In [70]:
full_features["HasDetections"] = full_labels
engineered_columns.append("HasDetections")

In [71]:
full_features[engineered_columns].to_csv('bruteforced_engineered.csv')

In [72]:
full_features[engineered_columns].dtypes

ProductName                                             int8
IsBeta                                                  int8
RtpStateBitfield                                       int64
IsSxsPassiveMode                                        int8
AVProductStatesIdentifier                              int64
AVProductsInstalled                                    int64
AVProductsEnabled                                      int64
CountryIdentifier                                      int64
CityIdentifier                                         int64
OrganizationIdentifier                                 int64
GeoNameIdentifier                                      int64
LocaleEnglishNameIdentifier                            int64
Platform                                               int16
Processor                                               int8
OsSuite                                                int64
OsPlatformSubRelease                                   int16
SkuEdition              