- **PuaMode** - Pua Enabled mode from the service
- **SMode** - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed
- **IeVerIdentifier** - NA
- **SmartScreen** - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.
- **Firewall** - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.
- **UacLuaenable** - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.
- **Census_MDC2FormFactor** - A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)
- **Census_DeviceFamily** - AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone
- **Census_OEMNameIdentifier** - NA
- **Census_OEMModelIdentifier** - NA
- **Census_ProcessorCoreCount** - Number of logical cores in the processor
- **Census_ProcessorManufacturerIdentifier** - NA
- **Census_ProcessorModelIdentifier** - NA
- **Census_ProcessorClas**s - A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated
- **Census_PrimaryDiskTotalCapacity** - Amount of disk space on primary disk of the machine in MB
- **Census_PrimaryDiskTypeName** - Friendly name of Primary Disk Type - HDD or SSD
- **Census_SystemVolumeTotalCapacity** - The size of the partition that the System volume is installed on in MB
- **Census_HasOpticalDiskDrive** - True indicates that the machine has an optical disk drive (CD/DVD)
- **Census_TotalPhysicalRAM** - Retrieves the physical RAM in MB
- **Census_ChassisTypeName** - Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx
- **Census_InternalPrimaryDiagonalDisplaySizeInInches** - Retrieves the physical diagonal length in inches of the primary display
- **Census_InternalPrimaryDisplayResolutionHorizontal** - Retrieves the number of pixels in the horizontal direction of the internal display.
- **Census_InternalPrimaryDisplayResolutionVertical** - Retrieves the number of pixels in the vertical direction of the internal display
- **Census_PowerPlatformRoleName** - Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device
- **Census_InternalBatteryType** - NA
- **Census_InternalBatteryNumberOfCharges** - NA
- **Census_OSVersion** - Numeric OS version Example - 10.0.10130.0

In [1]:
import pandas as pd

def processa_chunk(ms, valores):
    if(valores is None):
        valores = {}
        for v in ms.columns:
            valores.update({v: ms[v].value_counts(dropna=False)})
    else:
        for v in ms.columns:
            valores[v] =  pd.concat([valores[v],ms[v].value_counts(dropna=False)]).groupby(level=0).sum()
    return valores

In [2]:
valores = None
ms = pd.read_csv('../../../sample_train.csv', low_memory=False)
nLinhas = ms.shape[0]
valores = processa_chunk(ms,valores)

### Análise aos 27 atributos do meio, um por um

In [33]:
from pandas.api.types import is_numeric_dtype

def analisa_coluna(coluna):
    percent_missing = coluna.isnull().sum() * 100 / nLinhas
    print("Percentagem NAs: " + str(percent_missing))
    
    print("É numérico? " + str(is_numeric_dtype(coluna)))
    
    nome = coluna.unique()
    print("Valores diferentes: " + str(len(nome))) 

    print(nome)
    

In [61]:
coluna = ms['PuaMode']

analisa_coluna(coluna)

Percentagem NAs: 99.97716666666666
É numérico? False
Valores diferentes: 2
[nan 'on']


In [62]:
coluna = ms['SMode']

analisa_coluna(coluna)

Percentagem NAs: 6.0155
É numérico? True
Valores diferentes: 3
[ 0. nan  1.]


In [63]:
coluna = ms['IeVerIdentifier']

analisa_coluna(coluna)

Percentagem NAs: 0.6641666666666667
É numérico? True
Valores diferentes: 190
[137.  98. 117. 135. 108. 333. 103. 111.  73.  96.  nan  94. 107.  71.
 176.  76.  53.  81. 323. 280. 114.  74. 105.  41.  64.  84.  87.  78.
  68. 335. 196. 282. 320. 334. 186. 145.  88. 326. 337. 332. 194. 331.
  92.  44.  82.  90.  49. 307.  85. 190. 302.  42.  91. 288.  50. 201.
 364.  45.  72. 327.  86. 152.  46. 325.  66. 315. 312. 180. 205. 178.
 185. 284. 102. 163. 169.  77. 295. 297.  52. 199. 329.  65. 322. 311.
 147.  48. 153. 338. 277. 303. 287. 308. 318. 324. 158. 328. 296. 305.
 309. 162.  62.  63. 154. 275. 224. 383.  47. 347.  61. 316.  51.  21.
 336. 313. 319. 306. 429. 292. 388. 304. 300. 182.   1.  79. 317. 321.
 220.  58. 281. 218.  34. 156.  39. 290.  59.  55. 314.  89. 294. 289.
  60. 395. 151. 298. 174. 348. 406.  57. 381. 171. 384. 428.  16. 168.
 378. 109. 187. 142. 427. 362. 358. 330. 369. 212.   9.  11. 350. 398.
 390. 349. 166. 140. 150. 278.  56. 283. 299. 177. 143. 376. 345. 173.


In [64]:
coluna = ms['SmartScreen']

analisa_coluna(coluna)

Percentagem NAs: 35.611333333333334
É numérico? False
Valores diferentes: 14
[nan 'RequireAdmin' 'ExistsNotSet' 'Warn' 'Off' 'Prompt' 'Block' 'on'
 'off' 'On' '&#x01;' '&#x02;' '0' 'OFF']


In [65]:
coluna = ms['Firewall']

analisa_coluna(coluna)

Percentagem NAs: 1.0311666666666666
É numérico? True
Valores diferentes: 3
[ 1.  0. nan]


In [66]:
coluna = ms['UacLuaenable']

analisa_coluna(coluna)

Percentagem NAs: 0.1175
É numérico? True
Valores diferentes: 8
[1.0000000e+00           nan 0.0000000e+00 4.8000000e+01 4.9000000e+01
 6.3570620e+06 1.6777216e+07 2.0000000e+00]


In [67]:
coluna = ms['Census_MDC2FormFactor']

analisa_coluna(coluna)

Percentagem NAs: 0.0
É numérico? False
Valores diferentes: 12
['Notebook' 'Convertible' 'Desktop' 'AllInOne' 'SmallServer' 'Detachable'
 'LargeTablet' 'SmallTablet' 'PCOther' 'MediumServer' 'LargeServer'
 'ServerOther']


In [68]:
coluna = ms['Census_DeviceFamily']

analisa_coluna(coluna)

Percentagem NAs: 0.0
É numérico? False
Valores diferentes: 3
['Windows.Desktop' 'Windows.Server' 'Windows']


In [69]:
coluna = ms['Census_OEMNameIdentifier']

analisa_coluna(coluna)

Percentagem NAs: 1.0676666666666668
É numérico? True
Valores diferentes: 1665
[2668.  585. 2102. ... 2382.  514. 2429.]


In [70]:
coluna = ms['Census_OEMModelIdentifier']

analisa_coluna(coluna)

Percentagem NAs: 1.1478333333333333
É numérico? True
Valores diferentes: 45095
[ 62683. 171395. 189318. ... 314043. 325966.  33714.]


In [71]:
coluna = ms['Census_ProcessorCoreCount']

analisa_coluna(coluna)

Percentagem NAs: 0.4821666666666667
É numérico? True
Valores diferentes: 29
[ 4.  2.  8.  6. nan 12. 16.  1. 20.  5.  3. 28. 24. 32. 48. 10. 36. 40.
 56. 14. 88.  7. 64. 44. 72. 50. 30. 18. 80.]


In [72]:
coluna = ms['Census_ProcessorManufacturerIdentifier']

analisa_coluna(coluna)

Percentagem NAs: 0.4821666666666667
É numérico? True
Valores diferentes: 5
[ 5.  1. nan 10.  3.]


In [73]:
coluna = ms['Census_ProcessorModelIdentifier']

analisa_coluna(coluna)

Percentagem NAs: 0.4821666666666667
É numérico? True
Valores diferentes: 2332
[2574. 2697. 2011. ... 3689. 3141.  461.]


In [74]:
coluna = ms['Census_ProcessorClass']

analisa_coluna(coluna)

Percentagem NAs: 99.58683333333333
É numérico? False
Valores diferentes: 4
[nan 'mid' 'low' 'high']


In [75]:
coluna = ms['Census_PrimaryDiskTotalCapacity']

analisa_coluna(coluna)

Percentagem NAs: 0.6208333333333333
É numérico? True
Valores diferentes: 1194
[ 476940.  244198.  715404. ... 1812339.  715418.  139488.]


In [76]:
coluna = ms['Census_PrimaryDiskTypeName']

analisa_coluna(coluna)

Percentagem NAs: 0.14833333333333334
É numérico? False
Valores diferentes: 5
['HDD' 'SSD' 'Unspecified' 'UNKNOWN' nan]


In [77]:
coluna = ms['Census_SystemVolumeTotalCapacity']

analisa_coluna(coluna)

Percentagem NAs: 0.6206666666666667
É numérico? True
Valores diferentes: 158821
[ 69500. 127527. 632302. ... 444569. 198452. 146487.]


In [78]:
coluna = ms['Census_HasOpticalDiskDrive']

analisa_coluna(coluna)

Percentagem NAs: 0.0
É numérico? True
Valores diferentes: 2
[0 1]


In [79]:
coluna = ms['Census_TotalPhysicalRAM']

analisa_coluna(coluna)

Percentagem NAs: 0.9355
É numérico? True
Valores diferentes: 627
[4.09600e+03 8.19200e+03 2.04800e+03 6.14400e+03 1.63840e+04         nan
 1.22880e+04 3.07200e+03 1.02400e+03 1.02400e+04 3.27680e+04 3.58400e+03
 5.12000e+03 6.55360e+04 2.04700e+03 4.09500e+03 1.62960e+04 2.56000e+03
 2.45760e+04 7.16800e+03 8.19600e+03 4.19600e+03 2.00800e+03 2.04800e+04
 4.09600e+04 8.19100e+03 1.43360e+04 6.19700e+03 1.53600e+03 5.04400e+03
 8.70400e+03 2.18000e+03 1.63760e+04 4.08500e+03 9.83040e+04 8.17500e+03
 1.63820e+04 8.04400e+03 1.01400e+03 2.86720e+04 3.99900e+03 4.09400e+03
 1.02300e+03 6.01400e+03 5.99900e+03 8.51400e+03 1.63670e+04 8.09200e+03
 3.07100e+03 4.01200e+03 1.74080e+04 1.44700e+03 3.92500e+03 7.85200e+03
 2.94200e+03 4.91520e+04 1.59990e+04 1.31072e+05 8.00000e+03 1.04290e+04
 5.11800e+03 3.58300e+03 3.99600e+03 3.32800e+03 2.15040e+04 2.99900e+03
 3.04700e+03 2.04600e+03 3.00000e+03 3.32600e+03 4.60800e+03 6.14300e+03
 8.07800e+03 3.05400e+03 1.63750e+04 6.00000e+03 1.79000e+0

In [80]:
coluna = ms['Census_ChassisTypeName']

analisa_coluna(coluna)

Percentagem NAs: 0.0075
É numérico? False
Valores diferentes: 36
['Notebook' 'Desktop' 'Laptop' 'AllinOne' 'Portable' 'Convertible' 'Other'
 'UNKNOWN' 'Detachable' 'MiniTower' 'LowProfileDesktop'
 'MainServerChassis' 'HandHeld' 'Tower' 'SpaceSaving' 'LunchBox' 'Tablet'
 '30' 'MultisystemChassis' 'MiniPC' '32' 'RackMountChassis' 'Unknown' nan
 'SealedCasePC' 'StickPC' 'SubNotebook' 'BusExpansionChassis' 'Blade'
 'PizzaBox' '0' '35' '28' 'ExpansionChassis' '31' '88']


In [81]:
coluna = ms['Census_InternalPrimaryDiagonalDisplaySizeInInches']

analisa_coluna(coluna)

Percentagem NAs: 0.5431666666666667
É numérico? True
Valores diferentes: 509
[ 15.5  13.9  11.6  14.   21.9  23.   12.7  15.3  23.8  21.5  21.7  16.3
  43.   10.1  15.7  24.   18.5  28.8  13.3  20.   15.6  12.3  13.5  17.2
  15.4  15.   34.1  23.5  22.   17.3   nan  17.   13.2  27.    8.8  19.1
   8.   27.2  12.6  19.5  14.1  10.8  23.1  10.3  20.7  19.4  20.1  23.6
  21.6  18.7  52.   18.4  19.   11.5  18.9  46.   17.1  10.   19.7  16.2
  23.4  26.9  12.5  27.3  15.9  20.8  72.3   9.4  17.5  39.8  37.6  32.
  12.1  22.3  22.4  12.    9.9  31.5  23.7  54.6  45.8  20.4  14.5  24.7
  14.7  28.9  19.9  35.   14.9  24.1  24.2  26.1  22.2   6.9  31.7  10.6
  12.2  12.9  13.4  14.6  40.   25.    8.3  17.7  28.6  61.6  10.4  31.4
  23.3  14.8   9.7  20.6  18.6  16.7  10.5  15.2  16.8  23.2  64.5  31.6
  74.9  21.2  65.   24.5  22.9  21.1  16.6   9.   46.1  24.9  11.4  37.
  40.2  12.8  57.5  57.8  45.2  18.1  17.9  29.8  34.8  22.7  50.1  27.9
  39.5  48.6  26.6   9.8  19.3  19.2  31.2  16.  

In [82]:
coluna = ms['Census_InternalPrimaryDisplayResolutionHorizontal']

analisa_coluna(coluna)

Percentagem NAs: 0.5421666666666667
É numérico? True
Valores diferentes: 540
[ 1.366e+03  1.920e+03  1.680e+03  8.000e+02  1.600e+03  7.680e+02
  1.280e+03  1.360e+03  2.560e+03  1.024e+03  2.736e+03  3.000e+03
        nan  2.160e+03  1.440e+03  3.200e+03  1.080e+03  3.840e+03
  2.256e+03  1.144e+03  1.227e+03  3.440e+03  6.000e+02  6.400e+02
  1.824e+03  1.368e+03  1.536e+03  1.400e+03  1.786e+03  2.304e+03
  2.880e+03  1.502e+03  1.152e+03  1.882e+03  2.048e+03  1.776e+03
  1.692e+03  1.462e+03  9.000e+02  1.200e+03  1.768e+03  4.096e+03
  1.630e+03  3.360e+03  8.240e+02  1.625e+03  1.364e+03  1.800e+03
  1.050e+03  5.755e+03  3.814e+03  1.894e+03  2.197e+03 -1.000e+00
  1.303e+03  1.745e+03  9.820e+02  5.120e+03  1.240e+03  9.600e+03
  1.708e+03  1.064e+03  4.500e+03  1.495e+03  5.760e+03  1.594e+03
  1.922e+03  1.842e+03  1.301e+03  2.166e+03  3.240e+03  3.246e+03
  1.873e+03  1.262e+03  1.675e+03  1.079e+03  2.754e+03  1.299e+03
  1.489e+03  4.320e+03  1.862e+03  1.716e+03  1.590e

In [83]:
coluna = ms['Census_InternalPrimaryDisplayResolutionVertical']

analisa_coluna(coluna)

Percentagem NAs: 0.5421666666666667
É numérico? True
Valores diferentes: 569
[ 7.680e+02  1.080e+03  1.050e+03  6.000e+02  9.000e+02  1.366e+03
  8.000e+02  7.200e+02  1.824e+03  2.000e+03  1.024e+03        nan
  1.280e+03  1.440e+03  1.200e+03  1.800e+03  1.920e+03  9.520e+02
  9.750e+02  2.160e+03  1.504e+03  7.970e+02  1.600e+03  8.470e+02
  4.800e+02  1.026e+03  2.048e+03  9.600e+02  1.004e+03  2.736e+03
  9.450e+02  1.106e+03  8.640e+02  1.058e+03  1.536e+03  1.000e+03
  1.014e+03  5.760e+02  6.750e+02  9.760e+02  9.920e+02  2.304e+03
  9.020e+02  1.152e+03  6.480e+02  2.100e+03  6.420e+02  9.570e+02
  8.030e+02  7.240e+02  1.680e+03  1.001e+03  2.021e+03  2.400e+03
  9.430e+02  8.560e+02  1.137e+03 -1.000e+00  1.081e+03  7.470e+02
  3.900e+02  9.630e+02  8.820e+02  5.250e+02  2.880e+03  4.000e+02
  3.000e+03  7.310e+02  1.135e+03  1.036e+03  6.550e+02  7.190e+02
  1.269e+03  6.200e+02  1.338e+03  6.810e+02  6.640e+02  1.011e+03
  7.760e+02  9.580e+02  5.820e+02  1.570e+03  6.690e

In [84]:
coluna = ms['Census_PowerPlatformRoleName']

analisa_coluna(coluna)

Percentagem NAs: 0.0008333333333333334
É numérico? False
Valores diferentes: 10
['Mobile' 'Desktop' 'Slate' 'Workstation' 'UNKNOWN' 'SOHOServer'
 'EnterpriseServer' 'AppliancePC' nan 'PerformanceServer']


In [85]:
coluna = ms['Census_InternalBatteryType']

analisa_coluna(coluna)

Percentagem NAs: 71.05283333333334
É numérico? False
Valores diferentes: 31
[nan 'lion' 'li-i' 'lip' '#' 'liio' 'real' 'li p' 'li' 'bq20' 'nimh'
 'lgi0' 'batt' 'pbac' '4cel' 'vbox' 'unkn' 'ithi' 'lipo' 'pad0' 'lhp0'
 'a132' 'lio' 'virt' 'bad' 'lit' '4lio' 'lipp' 'ram' 'ca48' 'p-sn']


In [86]:
coluna = ms['Census_InternalBatteryNumberOfCharges']

analisa_coluna(coluna)

Percentagem NAs: 3.0425
É numérico? True
Valores diferentes: 5948
[0.0000000e+00 4.2949673e+09 8.0000000e+00 ... 5.4397000e+04 3.3000000e+03
 1.4664000e+04]


In [87]:
coluna = ms['Census_OSVersion']

analisa_coluna(coluna)

Percentagem NAs: 0.0
É numérico? False
Valores diferentes: 305
['10.0.17134.165' '10.0.14393.2189' '10.0.16299.431' '10.0.16299.125'
 '10.0.17134.112' '10.0.17134.1' '10.0.15063.1155' '10.0.16299.547'
 '10.0.17134.254' '10.0.10586.494' '10.0.17134.228' '10.0.17134.285'
 '10.0.17134.137' '10.0.14393.953' '10.0.16299.248' '10.0.15063.483'
 '10.0.15063.1088' '10.0.15063.1206' '10.0.10586.1045' '10.0.17134.191'
 '10.0.15063.1266' '10.0.14393.1944' '10.0.16299.492' '10.0.15063.966'
 '10.0.14393.351' '10.0.15063.909' '10.0.16299.611' '10.0.15063.786'
 '10.0.15063.1324' '10.0.16299.192' '10.0.16299.371' '10.0.14393.2125'
 '10.0.15063.632' '10.0.10586.0' '10.0.17733.1000' '10.0.14393.447'
 '10.0.16299.64' '10.0.15063.608' '10.0.10586.218' '10.0.17134.286'
 '10.0.16299.334' '10.0.10240.17443' '10.0.16299.15' '10.0.16299.522'
 '10.0.14393.2007' '10.0.10586.164' '10.0.16299.665' '10.0.10586.753'
 '10.0.17134.48' '10.0.16299.309' '10.0.10586.1176' '10.0.17134.167'
 '10.0.10586.1106' '10.0.16299.55