# Microsoft Malware Challenge

## Baseline model: SVM

## For CSCE 633 Machine Learning, Spring 2019, Course project
### Team: MARTHA
### Author: Rose Lin

This is the second notebook in the series. We have already performed a quick scan among all the variables in our *EDA1* notebook. Here we would like to build a baseline model (SVM), then check for improvement opportunities.

~Ongoing efforts, your feedback is appreciated!~

### RECAP

In EDA 1, we ended up concluding that the following 66 features can be kept for our model:

```
['ProductName' 'EngineVersion' 'AppVersion' 'AvSigVersion' 'IsBeta'
 'RtpStateBitfield' 'AVProductStatesIdentifier' 'AVProductsInstalled'
 'AVProductsEnabled' 'HasTpm' 'CountryIdentifier' 'CityIdentifier'
 'OrganizationIdentifier' 'GeoNameIdentifier'
 'LocaleEnglishNameIdentifier' 'Processor' 'OsVer' 'OsBuild' 'OsSuite'
 'OsPlatformSubRelease' 'OsBuildLab' 'IsProtected' 'AutoSampleOptIn'
 'SMode' 'IeVerIdentifier' 'SmartScreen' 'Firewall' 'UacLuaenable'
 'Census_MDC2FormFactor' 'Census_DeviceFamily' 'Census_OEMNameIdentifier'
 'Census_OEMModelIdentifier' 'Census_ProcessorCoreCount'
 'Census_ProcessorModelIdentifier' 'Census_PrimaryDiskTotalCapacity'
 'Census_PrimaryDiskTypeName' 'Census_SystemVolumeTotalCapacity'
 'Census_HasOpticalDiskDrive' 'Census_TotalPhysicalRAM'
 'Census_ChassisTypeName'
 'Census_InternalPrimaryDiagonalDisplaySizeInInches'
 'Census_InternalPrimaryDisplayResolutionHorizontal'
 'Census_PowerPlatformRoleName' 'Census_InternalBatteryNumberOfCharges'
 'Census_OSVersion' 'Census_OSBranch' 'Census_OSBuildRevision'
 'Census_OSEdition' 'Census_OSInstallTypeName'
 'Census_OSInstallLanguageIdentifier' 'Census_OSWUAutoUpdateOptionsName'
 'Census_IsPortableOperatingSystem' 'Census_GenuineStateName'
 'Census_ActivationChannel' 'Census_IsFlightsDisabled' 'Census_FlightRing'
 'Census_FirmwareManufacturerIdentifier'
 'Census_FirmwareVersionIdentifier' 'Census_IsSecureBootEnabled'
 'Census_IsVirtualDevice' 'Census_IsTouchEnabled' 'Census_IsPenCapable'
 'Census_IsAlwaysOnAlwaysConnectedCapable' 'Wdft_IsGamer'
 'Wdft_RegionIdentifier' 'HasDetections']
```

But is this correct? Can we keep more data?

This notebook would attempt to build a baseline SVM model to check its performance.

In [2]:
# load the data from Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from scipy.sparse import vstack, csr_matrix, save_npz, load_npz
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import gc
gc.enable()

In [0]:
pd.set_option('display.max_columns', 100)

In [0]:
# from https://www.kaggle.com/theoviel/load-the-totality-of-the-data
dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int16',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float64', # was 'float32'
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float32', # was 'float16'
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float32', # was 'float16'
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float64', # was 'float32'
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float64', # was 'float32'
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32', # was 'float16'
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32', # was 'float16'
        'Census_InternalPrimaryDisplayResolutionVertical':      'float32', # was 'float16'
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float64', # was 'float32'
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }

In [6]:
train = pd.read_csv('/content/gdrive/My Drive/Coding experiment/MARTHA/data/train.csv', dtype=dtypes, low_memory=True)
train['MachineIdentifier'] = train.index.astype('uint32')
test  = pd.read_csv('/content/gdrive/My Drive/Coding experiment/MARTHA/data/test.csv',  dtype=dtypes, low_memory=True)
test['MachineIdentifier']  = test.index.astype('uint32')
gc.collect()

201012

In [0]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8921483 entries, 0 to 8921482
Data columns (total 83 columns):
MachineIdentifier                                    uint64
ProductName                                          category
EngineVersion                                        category
AppVersion                                           category
AvSigVersion                                         category
IsBeta                                               int8
RtpStateBitfield                                     float16
IsSxsPassiveMode                                     int8
DefaultBrowsersIdentifier                            float32
AVProductStatesIdentifier                            float32
AVProductsInstalled                                  float16
AVProductsEnabled                                    float16
HasTpm                                               int8
CountryIdentifier                                    int16
CityIdentifier                           

In [0]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7853253 entries, 0 to 7853252
Data columns (total 82 columns):
MachineIdentifier                                    uint64
ProductName                                          category
EngineVersion                                        category
AppVersion                                           category
AvSigVersion                                         category
IsBeta                                               int8
RtpStateBitfield                                     float16
IsSxsPassiveMode                                     int8
DefaultBrowsersIdentifier                            float32
AVProductStatesIdentifier                            float32
AVProductsInstalled                                  float16
AVProductsEnabled                                    float16
HasTpm                                               int8
CountryIdentifier                                    int16
CityIdentifier                           

In [0]:
# take a look at the head
train.head()

Unnamed: 0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_ProcessorClass,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,0,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,0,7.0,0,,53447.0,1.0,1.0,1,29,128035.0,18.0,35.0,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,Desktop,Windows.Desktop,2668.0,9124.0,4.0,5.0,2341.0,,476940.0,HDD,299451.0,0,4096.0,Desktop,18.9,1440.0,900.0,Desktop,,4294967000.0,10.0.17134.165,amd64,rs4_release,17134,165,Professional,PROFESSIONAL,UUPUpgrade,26.0,119,UNKNOWN,0,IS_GENUINE,Retail,,0.0,Retail,,628.0,36144.0,0,,0.0,0,0,0.0,0.0,10.0,0
1,1,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,0,7.0,0,,53447.0,1.0,1.0,1,93,1482.0,18.0,119.0,64,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,Notebook,Windows.Desktop,2668.0,91656.0,4.0,5.0,2405.0,,476940.0,HDD,102385.0,0,4096.0,Notebook,13.9,1366.0,768.0,Mobile,,1.0,10.0.17134.1,amd64,rs4_release,17134,1,Professional,PROFESSIONAL,IBSClean,8.0,31,UNKNOWN,0,OFFLINE,Retail,,0.0,NOT_SET,,628.0,57858.0,0,,0.0,0,0,0.0,0.0,8.0,0
2,2,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,0,7.0,0,,53447.0,1.0,1.0,1,86,153579.0,18.0,64.0,49,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,Desktop,Windows.Desktop,4909.0,317701.0,4.0,5.0,1972.0,,114473.0,SSD,113907.0,0,4096.0,Desktop,21.5,1920.0,1080.0,Desktop,,4294967000.0,10.0.17134.165,amd64,rs4_release,17134,165,Core,CORE,UUPUpgrade,7.0,30,FullAuto,0,IS_GENUINE,OEM:NONSLP,,0.0,Retail,,142.0,52682.0,0,,0.0,0,0,0.0,0.0,3.0,0
3,3,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,0,7.0,0,,53447.0,1.0,1.0,1,88,20710.0,,117.0,115,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,Desktop,Windows.Desktop,1443.0,275890.0,4.0,5.0,2273.0,,238475.0,UNKNOWN,227116.0,0,4096.0,MiniTower,18.5,1366.0,768.0,Desktop,,4294967000.0,10.0.17134.228,amd64,rs4_release,17134,228,Professional,PROFESSIONAL,UUPUpgrade,17.0,64,FullAuto,0,IS_GENUINE,OEM:NONSLP,,0.0,Retail,,355.0,20050.0,0,,0.0,0,0,0.0,0.0,3.0,1
4,4,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,0,7.0,0,,53447.0,1.0,1.0,1,18,37376.0,,277.0,75,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,Notebook,Windows.Desktop,1443.0,331929.0,4.0,5.0,2500.0,,476940.0,HDD,101900.0,0,6144.0,Portable,14.0,1366.0,768.0,Mobile,lion,0.0,10.0.17134.191,amd64,rs4_release,17134,191,Core,CORE,Update,8.0,31,FullAuto,0,IS_GENUINE,Retail,0.0,0.0,Retail,0.0,355.0,19844.0,0,0.0,0.0,0,0,0.0,0.0,1.0,1


In [0]:
train.shape

(8921483, 83)

In [0]:
test.shape

(7853253, 82)

### Deal with missing values

In order for the `scikit-learn` implementation of SVM to work, we will have to fill all the missing values.

Note: if using LightGBM or XGBoost we don't have to do feature selection (auto selected, based on decision trees). The 2 ensemble methods have a better tolerance on NANs.

In [0]:
# Nan Values
null_counts = train.isnull().sum()/train.shape[0]
#print(null_counts)
print("Columns with at least 1 NA:")
for index in null_counts[null_counts > 0].index:
  print("name:", index, "types:", train[index].dtypes, '% null:', null_counts[index])

Columns with at least 1 NA:
name: RtpStateBitfield types: float16 % null: 0.003622491910817966
name: DefaultBrowsersIdentifier types: float32 % null: 0.9514163732644001
name: AVProductStatesIdentifier types: float32 % null: 0.004059975230575455
name: AVProductsInstalled types: float16 % null: 0.004059975230575455
name: AVProductsEnabled types: float16 % null: 0.004059975230575455
name: CityIdentifier types: float32 % null: 0.0364747654621995
name: OrganizationIdentifier types: float16 % null: 0.3084148677972037
name: GeoNameIdentifier types: float16 % null: 2.3874954421815298e-05
name: OsBuildLab types: category % null: 2.3538687458127756e-06
name: IsProtected types: float16 % null: 0.00404013547971789
name: PuaMode types: category % null: 0.9997411865269485
name: SMode types: float16 % null: 0.0602768620418825
name: IeVerIdentifier types: float16 % null: 0.006601368853137982
name: SmartScreen types: category % null: 0.35610794752397107
name: Firewall types: float16 % null: 0.010239329

So we have 44 features that have NAN values. I tried to drop those features, but then we will be left with too few parameters. If dropping rows containing at least 1 NAN, we will end up with less than half of the records. Neither way seems to be an ideal solution.

Inspired by [this notebook](https://www.kaggle.com/bogorodvo/lightgbm-baseline-model-using-sparse-matrix), I think we should try converting all features into categorical encoding, adding a "NA" category, then fit with SVM. (To better preserve the data)

* Note: this is using label encoding only. <u>A possibility is to try out different type of encoding based on the type of the data.</u> [source](https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm)

In [7]:
print('Transform all features to category.\n')
count = 0
for usecol in train.columns.tolist()[1:-1]:

    if count % 5 == 0:
        print("Processed",count,"features.")
      
    train[usecol] = train[usecol].astype('str')
    test[usecol] = test[usecol].astype('str')
    
    #Fit LabelEncoder
    le = LabelEncoder().fit(
            np.unique(train[usecol].unique().tolist()+
                      test[usecol].unique().tolist()))

    #At the end 0 will be used for dropped values
    train[usecol] = le.transform(train[usecol])+1
    test[usecol]  = le.transform(test[usecol])+1

    agg_tr = (train
              .groupby([usecol])
              .aggregate({'MachineIdentifier':'count'})
              .reset_index()
              .rename({'MachineIdentifier':'Train'}, axis=1))
    agg_te = (test
              .groupby([usecol])
              .aggregate({'MachineIdentifier':'count'})
              .reset_index()
              .rename({'MachineIdentifier':'Test'}, axis=1))

    agg = pd.merge(agg_tr, agg_te, on=usecol, how='outer').replace(np.nan, 0)
    #Select values with more than 1000 observations
    agg = agg[(agg['Train'] > 1000)].reset_index(drop=True)
    agg['Total'] = agg['Train'] + agg['Test']
    #Drop unbalanced values
    agg = agg[(agg['Train'] / agg['Total'] > 0.2) & (agg['Train'] / agg['Total'] < 0.8)]
    agg[usecol+'Copy'] = agg[usecol]

    train[usecol] = (pd.merge(train[[usecol]], 
                              agg[[usecol, usecol+'Copy']], 
                              on=usecol, how='left')[usecol+'Copy']
                     .replace(np.nan, 0).astype('int').astype('category'))

    test[usecol]  = (pd.merge(test[[usecol]], 
                              agg[[usecol, usecol+'Copy']], 
                              on=usecol, how='left')[usecol+'Copy']
                     .replace(np.nan, 0).astype('int').astype('category'))

    del le, agg_tr, agg_te, agg, usecol
    gc.collect()
    count += 1

Transform all features to category.

Processed 0 features.
Processed 5 features.
Processed 10 features.
Processed 15 features.
Processed 20 features.
Processed 25 features.
Processed 30 features.
Processed 35 features.
Processed 40 features.
Processed 45 features.
Processed 50 features.
Processed 55 features.
Processed 60 features.
Processed 65 features.
Processed 70 features.
Processed 75 features.
Processed 80 features.


Take a look at the result. We can see that all columns have been encoded as categories.

In [0]:
train.sample(n=2)

Unnamed: 0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_ProcessorClass,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
1326058,1326058,5,0,0,0,1,7,1,2610,26978,2,2,2,67,128257,10,246,218,1,3,1,5,4,4,291,7,2,1,2,0,17,10,2,2,11,2,0,0,31,6,828,4,6871,2,0,2,1515,43,77,930,82,7,72,0,0,3,13,122,0,6,4,7,31,61,3,1,2,3,3,1,8,3,433,0,2,3,1,1,1,2,3,16,0
8350527,8350527,5,0,23,0,1,7,1,2610,26978,2,2,2,48,126479,19,76,142,1,2,1,5,13,4,293,5,2,1,2,0,17,21,2,2,9,2,1445,0,17,1,0,4,8637,1,0,1,2347,43,89,374,1576,4,72,1,364,1,11,102,242,8,6,6,39,85,6,1,2,3,3,1,8,3,544,0,2,3,1,1,1,1,3,16,0


In [0]:
y_target = np.array(train['HasDetections'])
y_target_orig = train['HasDetections']
train_ids = train.index
test_ids  = test.index

del train['HasDetections'], train['MachineIdentifier'], test['MachineIdentifier']
gc.collect()

21

In [0]:
y_target.shape

(8921483,)

In [0]:
train.shape

(7434570,)

Using dense matrix now to fit SVM. Will try sparse matrix later.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train, y_target, test_size=0.33, random_state=42)

Unfortunately Linear SVC never works on Colab (though we have like 12GB RAM here). That's because scikit-learn implementation requires to see all data at once, and we happen to have a really large dataset. So instead of sparse representation, I will try some online fitting algorithms instead. (See the scikit-learn documentation [here](https://scikit-learn.org/0.15/modules/scaling_strategies.html))

**IMPORTANT!** SGD Classifier is linear. So basically we've posed a linear assumption of the data (i.e. the relationship between the independent and dependent variables is linear.). This may or may not hold true, though. Also for linear models to work there are a set of [assumptions](https://www.statisticssolutions.com/assumptions-of-linear-regression/) we would need to satisfy. Just fitting all the data through SGD to create a naive model as of now.

In [0]:
# DON'T TRY TO RUN THIS! IT WON'T WORK!
# from sklearn.svm import LinearSVC

# clf = LinearSVC(random_state=0, tol=1e-5)
# clf.fit(X_train, y_train)

In [0]:
# clf.score(X_test, y_test)

In [0]:
# Trying SGD Classifier below
from sklearn.linear_model import SGDClassifier
import random

clf = SGDClassifier()

batch_size = 1000
residual = 0
num_of_iter = int(len(X_train) / batch_size)
residual = len(X_train) % batch_size
if num_of_iter * batch_size < len(X_train):
  num_of_iter += 1

In [0]:
print("num_of_iter:",num_of_iter,"residual:",residual)

num_of_iter: 5978 residual: 393


In [0]:
for i in range(num_of_iter):
  if i != num_of_iter - 1:
    subset = slice(batch_size * i, batch_size * (i + 1))
    #slice_indexes = [i for i in range(batch_size * i,batch_size * (i + 1))]
  else:
    subset = slice(batch_size * i, len(X_train))
    #slice_indexes = [i for i in range(batch_size * i, len(X_train))]
  #random.shuffle(slice_indexes)
  X_sub = X_train[subset]
  y_sub = y_train[subset]
#   X_sub = X_train.loc[slice_indexes]
#   y_sub = y_train[slice_indexes]
  if i % batch_size == 0:
    print("Finished %s iterations." % i)
  clf.partial_fit(X_sub, y_sub, classes=np.unique(y_train))

Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.
Finished 3000 iterations.
Finished 4000 iterations.
Finished 5000 iterations.


In [0]:
# predict on unseen data
# a small test below
y_pred = clf.predict(X_test[:10])
y_pred

array([1, 1, 1, 0, 0, 1, 1, 0, 1, 0], dtype=int8)

In [0]:
y_test[:10]

array([1, 0, 0, 1, 1, 1, 0, 1, 0, 1], dtype=int8)

In [0]:
# Persist model
from joblib import dump, load

dump(clf, '/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGDClassifier.joblib')

['/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGDClassifier.joblib']

In [0]:
# Getting an accuracy measurement
clf.score(X_test, y_test)

0.5159889813151093

In [0]:
# Confusion matrix
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.52      0.51      0.51   1473693
           1       0.52      0.53      0.52   1470397

   micro avg       0.52      0.52      0.52   2944090
   macro avg       0.52      0.52      0.52   2944090
weighted avg       0.52      0.52      0.52   2944090



In [0]:
# ROC?
print('SGD (Raw) Testing set AUC Score: {}'.format(roc_auc_score(y_test, y_pred)))

SGD (Raw) Testing set AUC Score: 0.516000363652941


Looks like SGD is quick and converged, but the accuracy and precision suffers. This can serve as our baseline model.

Tried 3 versions of SGD with different batch size:

* Batch size = 2000: saved model, accuracy ~ 50%
* Batch size = 1000: accuracy ~ 51.6%
* Batch size = 500: accuracy ~ 49.52%

How can we bump up the accuracy even more using SGD?

In [0]:
# Trying to use K-Fold here
num_of_split = 6
skf = StratifiedKFold(n_splits=num_of_split, shuffle=True, random_state=42)
skf.get_n_splits(train.index, y_target_orig)

6

In [0]:
# Trying SGD Classifier below
from sklearn.linear_model import SGDClassifier
import random

clf = SGDClassifier(alpha=1e-05)

In [0]:
index = 1
for train_index, test_index in skf.split(train.index, y_target_orig):
    X_fit, X_val = train[train.index.isin(train_index)],   train[train.index.isin(test_index)]
    y_fit, y_val = y_target_orig[y_target_orig.index.isin(train_index)], y_target_orig[y_target_orig.index.isin(test_index)]

    batch_size = 1000
    residual = 0
    num_of_iter = int(len(X_fit) / batch_size)
    residual = len(X_fit) % batch_size
    if num_of_iter * batch_size < len(X_fit):
      num_of_iter += 1
      
    print("split:",index,"num_of_iter:",num_of_iter,"residual:",residual)
    
    for i in range(num_of_iter):
      if i != num_of_iter - 1:
        subset = slice(batch_size * i, batch_size * (i + 1))
      else:
        subset = slice(batch_size * i, len(X_fit))
      X_sub = X_fit[subset]
      y_sub = y_fit[subset]
      if i % batch_size == 0:
        print("Finished %s iterations." % i)
      clf.partial_fit(X_sub, y_sub, classes=np.unique(y_target_orig))
      
    y_pred = clf.predict(X_val)
    print('\nSGD (Raw) AUC Score: {}'.format(roc_auc_score(y_val, y_pred)),'Fold:',index)

    del X_fit, X_val, y_fit, y_val
    gc.collect()
    index += 1

split: 1 num_of_iter: 7435 residual: 568
Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.
Finished 3000 iterations.
Finished 4000 iterations.
Finished 5000 iterations.
Finished 6000 iterations.
Finished 7000 iterations.

SGD (Raw) AUC Score: 0.4921278054992499 Fold: 1
split: 2 num_of_iter: 7435 residual: 569
Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.
Finished 3000 iterations.
Finished 4000 iterations.
Finished 5000 iterations.
Finished 6000 iterations.
Finished 7000 iterations.

SGD (Raw) AUC Score: 0.4955276040800727 Fold: 2
split: 3 num_of_iter: 7435 residual: 569
Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.
Finished 3000 iterations.
Finished 4000 iterations.
Finished 5000 iterations.
Finished 6000 iterations.
Finished 7000 iterations.

SGD (Raw) AUC Score: 0.5004599929642581 Fold: 3
split: 4 num_of_iter: 7435 residual: 569
Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.

Looks like using 6-fold CV our AUC is stable around 0.5. Still... more like a baseline only :(

## Fine tuning for SGD model

Referring to the [guide](https://scikit-learn.org/stable/modules/sgd.html#tips-on-practical-use) here for practical tuning.

In [0]:
# Check how many iterations are indeed needed
max_iter = np.ceil(10**6 / len(X_train))
max_iter

1.0

And yes I want to use 1 iteration but clearly doing so will blow up the RAM and I don't want to start over again.

The guide also suggests scaling features, but I doubt if we can do so for categorical features (citation needed). 

But we can try to search for the best alpha to see if it helps. (Note: alpha is to control the amount of regularization)

In [0]:
# Since GridSearchCV doesn't have a partial_fit function I would have to do this myself
from sklearn.model_selection import GridSearchCV, cross_val_score

alphas = 10.0**-np.arange(1,7)
alphas

array([1.e-01, 1.e-02, 1.e-03, 1.e-04, 1.e-05, 1.e-06])

In [0]:
best_alpha = 0
best_accuracy = -float("inf")

In [0]:
for a in alphas:
  print("Testing with alpha =",a)
  clf = SGDClassifier(alpha = a)
  
  for i in range(num_of_iter):
    if i != num_of_iter - 1:
      subset = slice(batch_size * i, batch_size * (i + 1))
    else:
      subset = slice(batch_size * i, len(X_train))
    X_sub = X_train[subset]
    y_sub = y_train[subset]
    if i % 2000 == 0:
      print("Finished %s iterations." % i)
    clf.partial_fit(X_sub, y_sub, classes=np.unique(y_train))
    
  print("")
  print("Evaluating the overall performance now.")
  acc_score = clf.score(X_test, y_test)
  print("Accuracy of this model:",acc_score)
  if acc_score > best_accuracy:
    best_accuracy = acc_score
    best_alpha = a

Testing with alpha = 0.1
Finished 0 iterations.
Finished 2000 iterations.
Finished 4000 iterations.

Evaluating the overall performance now.
Accuracy of this model: 0.49762711058425524
Testing with alpha = 0.01
Finished 0 iterations.
Finished 2000 iterations.
Finished 4000 iterations.

Evaluating the overall performance now.
Accuracy of this model: 0.5053941965089382
Testing with alpha = 0.001
Finished 0 iterations.
Finished 2000 iterations.
Finished 4000 iterations.

Evaluating the overall performance now.
Accuracy of this model: 0.5013369156513557
Testing with alpha = 0.0001
Finished 0 iterations.
Finished 2000 iterations.
Finished 4000 iterations.

Evaluating the overall performance now.
Accuracy of this model: 0.5058520629464452
Testing with alpha = 1e-05
Finished 0 iterations.
Finished 2000 iterations.
Finished 4000 iterations.

Evaluating the overall performance now.
Accuracy of this model: 0.5188931044906915
Testing with alpha = 1e-06
Finished 0 iterations.
Finished 2000 iterati

Looks like when alpha = 1e-05 we got the best accuracy of 0.5188931044906915. (But still this is not good enough :()

Another thing to consider is feature correlations. In EDA 1 we've shown that there are some features that are correlated. What if we drop them? Will it boost the performance?

TODO: https://sklearn.org/modules/scaling_strategies.html

(If using [Perceptron](https://sklearn.org/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron):

> Perceptron and SGDClassifier share the same underlying implementation. In fact, Perceptron() is equivalent to SGDClassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=None).)

Code below manages to fit SGD using the columns we selected from EDA1. Again the accuracy is around 50% only. All features kept should have low correlation with each other.

In [0]:
# keep only the columns that we want
train_cleaned = train[['ProductName','EngineVersion','AppVersion','AvSigVersion','IsBeta','RtpStateBitfield','AVProductStatesIdentifier','AVProductsInstalled','AVProductsEnabled','HasTpm','CountryIdentifier','CityIdentifier','OrganizationIdentifier','GeoNameIdentifier','LocaleEnglishNameIdentifier','Processor','OsVer','OsBuild','OsSuite','OsPlatformSubRelease','OsBuildLab','IsProtected','AutoSampleOptIn','SMode','IeVerIdentifier','SmartScreen','Firewall','UacLuaenable','Census_MDC2FormFactor','Census_DeviceFamily','Census_OEMNameIdentifier','Census_OEMModelIdentifier','Census_ProcessorCoreCount','Census_ProcessorModelIdentifier','Census_PrimaryDiskTotalCapacity','Census_PrimaryDiskTypeName','Census_SystemVolumeTotalCapacity','Census_HasOpticalDiskDrive','Census_TotalPhysicalRAM','Census_ChassisTypeName','Census_InternalPrimaryDiagonalDisplaySizeInInches','Census_InternalPrimaryDisplayResolutionHorizontal','Census_PowerPlatformRoleName','Census_InternalBatteryNumberOfCharges','Census_OSVersion','Census_OSBranch','Census_OSBuildRevision','Census_OSEdition','Census_OSInstallTypeName','Census_OSInstallLanguageIdentifier','Census_OSWUAutoUpdateOptionsName','Census_IsPortableOperatingSystem','Census_GenuineStateName','Census_ActivationChannel','Census_IsFlightsDisabled','Census_FlightRing','Census_FirmwareManufacturerIdentifier','Census_FirmwareVersionIdentifier','Census_IsSecureBootEnabled','Census_IsVirtualDevice','Census_IsTouchEnabled','Census_IsPenCapable','Census_IsAlwaysOnAlwaysConnectedCapable','Wdft_IsGamer','Wdft_RegionIdentifier']]
test_cleaned = test[['ProductName','EngineVersion','AppVersion','AvSigVersion','IsBeta','RtpStateBitfield','AVProductStatesIdentifier','AVProductsInstalled','AVProductsEnabled','HasTpm','CountryIdentifier','CityIdentifier','OrganizationIdentifier','GeoNameIdentifier','LocaleEnglishNameIdentifier','Processor','OsVer','OsBuild','OsSuite','OsPlatformSubRelease','OsBuildLab','IsProtected','AutoSampleOptIn','SMode','IeVerIdentifier','SmartScreen','Firewall','UacLuaenable','Census_MDC2FormFactor','Census_DeviceFamily','Census_OEMNameIdentifier','Census_OEMModelIdentifier','Census_ProcessorCoreCount','Census_ProcessorModelIdentifier','Census_PrimaryDiskTotalCapacity','Census_PrimaryDiskTypeName','Census_SystemVolumeTotalCapacity','Census_HasOpticalDiskDrive','Census_TotalPhysicalRAM','Census_ChassisTypeName','Census_InternalPrimaryDiagonalDisplaySizeInInches','Census_InternalPrimaryDisplayResolutionHorizontal','Census_PowerPlatformRoleName','Census_InternalBatteryNumberOfCharges','Census_OSVersion','Census_OSBranch','Census_OSBuildRevision','Census_OSEdition','Census_OSInstallTypeName','Census_OSInstallLanguageIdentifier','Census_OSWUAutoUpdateOptionsName','Census_IsPortableOperatingSystem','Census_GenuineStateName','Census_ActivationChannel','Census_IsFlightsDisabled','Census_FlightRing','Census_FirmwareManufacturerIdentifier','Census_FirmwareVersionIdentifier','Census_IsSecureBootEnabled','Census_IsVirtualDevice','Census_IsTouchEnabled','Census_IsPenCapable','Census_IsAlwaysOnAlwaysConnectedCapable','Wdft_IsGamer','Wdft_RegionIdentifier']]

In [0]:
train_cleaned.shape

(8921483, 65)

In [0]:
y_target = np.array(train['HasDetections'])
train_ids = train_cleaned.index
test_ids  = test_cleaned.index

del train['HasDetections'], train['MachineIdentifier'], test['MachineIdentifier']
gc.collect()

7

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_cleaned, y_target, test_size=0.33, random_state=42)

In [0]:
X_train.shape

(5977393, 65)

Trying out SGD again...

In [0]:
# Trying SGD Classifier below
from sklearn.linear_model import SGDClassifier
import random

clf = SGDClassifier()

batch_size = 1000
residual = 0
num_of_iter = int(len(X_train) / batch_size)
residual = len(X_train) % batch_size
if num_of_iter * batch_size < len(X_train):
  num_of_iter += 1

In [0]:
print("num_of_iter:",num_of_iter,"residual:",residual)

num_of_iter: 5978 residual: 393


In [0]:
for i in range(num_of_iter):
  if i != num_of_iter - 1:
    subset = slice(batch_size * i, batch_size * (i + 1))
    #slice_indexes = [i for i in range(batch_size * i,batch_size * (i + 1))]
  else:
    subset = slice(batch_size * i, len(X_train))
    #slice_indexes = [i for i in range(batch_size * i, len(X_train))]
  #random.shuffle(slice_indexes)
  X_sub = X_train[subset]
  y_sub = y_train[subset]
#   X_sub = X_train.loc[slice_indexes]
#   y_sub = y_train[slice_indexes]
  if i % batch_size == 0:
    print("Finished %s iterations." % i)
  clf.partial_fit(X_sub, y_sub, classes=np.unique(y_train))

Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.
Finished 3000 iterations.
Finished 4000 iterations.
Finished 5000 iterations.


In [0]:
# predict on unseen data
# a small test below
y_pred = clf.predict(X_test[:10])
y_pred

array([0, 0, 0, 1, 1, 0, 0, 0, 0, 1], dtype=int8)

In [0]:
y_test[:10]

array([1, 0, 0, 1, 1, 1, 0, 1, 0, 1], dtype=int8)

In [0]:
# Getting an accuracy measurement
clf.score(X_test, y_test)

0.4949661864956574

In [0]:
# Confusion matrix
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.50      0.58      0.53   1473693
           1       0.49      0.41      0.45   1470397

   micro avg       0.49      0.49      0.49   2944090
   macro avg       0.49      0.49      0.49   2944090
weighted avg       0.49      0.49      0.49   2944090



Looks like fewer columns don't help with the situation here. Trying out sparse matrix below.

## Sparse matrix SGD Classifier

In [0]:
#Fit OneHotEncoder
ohe = OneHotEncoder(categories='auto', sparse=True, dtype='uint8').fit(train)

#Transform data using small groups to reduce memory usage
m = 100000
train2 = vstack([ohe.transform(train[i*m:(i+1)*m]) for i in range(train.shape[0] // m + 1)])
test  = vstack([ohe.transform(test[i*m:(i+1)*m])  for i in range(test.shape[0] // m +  1)])

In [0]:
save_npz('/content/gdrive/My Drive/Coding experiment/MARTHA/data/train.npz', train2, compressed=True)
save_npz('/content/gdrive/My Drive/Coding experiment/MARTHA/data/test.npz',  test,  compressed=True)

del ohe, train, train2, test
gc.collect()

1148

In [0]:
train = load_npz('/content/gdrive/My Drive/Coding experiment/MARTHA/data/train.npz')

In [0]:
# take a look at the sparse matrix
train

<8921483x7861 sparse matrix of type '<class 'numpy.uint8'>'
	with 722640123 stored elements in Compressed Sparse Row format>

In [0]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf.get_n_splits(train_ids, y_target_orig)
lgb_test_result_1  = np.zeros(test_ids.shape[0])
oof_1= np.zeros(train_ids.shape[0])
m = 100000

counter = 0

print('\nSGD(Sparse)\n')

for train_index, test_index in skf.split(train_ids, y_target_orig):
    
    print('Fold {}\n'.format(counter + 1))
    
    train = load_npz('/content/gdrive/My Drive/Coding experiment/MARTHA/data/train.npz')
    X_fit = vstack([train[train_index[i*m:(i+1)*m]] for i in range(train_index.shape[0] // m + 1)])
    X_val = vstack([train[test_index[i*m:(i+1)*m]]  for i in range(test_index.shape[0] //  m + 1)])
    X_fit, X_val = csr_matrix(X_fit, dtype='float32'), csr_matrix(X_val, dtype='float32')
    y_fit, y_val = y_target_orig[train_index], y_target_orig[test_index]
    
    batch_size = 1000
    residual = 0
    num_of_iter = int(len(X_fit) / batch_size)
    residual = len(X_fit) % batch_size
    if num_of_iter * batch_size < len(X_fit):
      num_of_iter += 1
      
    print("split:",index,"num_of_iter:",num_of_iter,"residual:",residual)
    
    for i in range(num_of_iter):
      if i != num_of_iter - 1:
        subset = slice(batch_size * i, batch_size * (i + 1))
      else:
        subset = slice(batch_size * i, len(X_fit))
      X_sub = X_fit[subset]
      y_sub = y_fit[subset]
      if i % batch_size == 0:
        print("Finished %s iterations." % i)
      clf.partial_fit(X_sub, y_sub, classes=np.unique(y_target_orig))
      
    y_pred = clf.predict(X_val)
    print('\nSGD (Sparse) AUC Score: {}'.format(roc_auc_score(y_val, y_pred)),'Fold:',index)

    del X_fit, X_val, y_fit, y_val
    gc.collect()
    index += 1


SGD(Sparse)

Fold 1



In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_ids, y_target, test_size=0.33, random_state=42)

In [0]:
# Trying SGD Classifier below
from sklearn.linear_model import SGDClassifier
import random

clf = SGDClassifier(alpha=1e-5)

batch_size = 1000
residual = 0
num_of_iter = int(len(X_train) / batch_size)
residual = len(X_train) % batch_size
if num_of_iter * batch_size < len(X_train):
  num_of_iter += 1

In [0]:
print("num_of_iter:",num_of_iter,"residual:",residual)

num_of_iter: 5978 residual: 393


In [0]:
for i in range(num_of_iter):
  if i != num_of_iter - 1:
    subset = slice(batch_size * i, batch_size * (i + 1))
  else:
    subset = slice(batch_size * i, len(X_train))
  X_sub = vstack(train2[train_ids[subset]])
  X_sub = csr_matrix(X_sub, dtype='float32')
  y_sub = y_train[subset]
  if i % batch_size == 0:
    print("Finished %s iterations." % i)
  clf.partial_fit(X_sub, y_sub, classes=np.unique(y_train))

Finished 0 iterations.
Finished 1000 iterations.
Finished 2000 iterations.
Finished 3000 iterations.
Finished 4000 iterations.
Finished 5000 iterations.


In [0]:
# Persist model
from joblib import dump, load

dump(clf, '/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGD_sparse.joblib')

['/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGD_sparse.joblib']

In [0]:
# Load the saved model
from joblib import dump, load
clf = load('/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGD_sparse.joblib')
clf

SGDClassifier(alpha=1e-05, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

Looks like the testing set vstack would cause a timeout on Google Colab. Unable to continue below.

In [0]:
vstack(train[train_ids[X_test]])

In [0]:
# Getting an accuracy measurement
X_test_sparse = csr_matrix(vstack(train[train_ids[X_test]]), dtype='float32')

In [0]:
clf.score(X_test_sparse, y_test)

In [0]:
# Confusion matrix
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test_sparse)
print(classification_report(y_test, y_pred))

In [0]:
# ROC?
print('SGD (Sparse) Testing set AUC Score: {}'.format(roc_auc_score(y_test, y_pred)))

## Generating predictions

Using the compact model. Note that feature transformation into numerical encoding is required.

In [8]:
del train
gc.collect()

574

In [0]:
submission = pd.read_csv('/content/gdrive/My Drive/Coding experiment/MARTHA/data/sample_submission.csv')
svm_test_result  = np.zeros(submission['MachineIdentifier'].shape[0])

In [10]:
from joblib import dump, load
model_path = '/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGDClassifier.joblib'
clf = load(model_path)
clf

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [0]:
batch_size = 500
residual = 0
num_of_iter = int(len(test) / batch_size)
residual = len(test) % batch_size
if num_of_iter * batch_size < len(test):
  num_of_iter += 1

In [13]:
print("num_of_iter:",num_of_iter,"residual:",residual)

num_of_iter: 15707 residual: 253


In [14]:
print("\nSVM - Prediction\n")
for i in range(num_of_iter):
  if i != num_of_iter - 1:
    subset = slice(batch_size * i, batch_size * (i + 1))
  else:
    subset = slice(batch_size * i, len(X_train))
  X_sub = test[subset]
  if i % batch_size == 0:
    print("Finished %s iterations." % i)
  svm_test_result[subset] = clf.predict_proba(test)
  
del clf
gc.collect()


SVM - Prediction

Finished 0 iterations.


AttributeError: ignored

Looks like the SGD model (using hinge loss) is not able to generate predicition probability. So no test result is available.