# Microsoft Malware Challenge

## EDA 2 & Baseline model

## For CSCE 633 Machine Learning, Spring 2019, Course project
### Team: MARTHA
### Author: Rose Lin

This is the second notebook in the series. We have already performed a quick scan among all the variables in our *EDA1* notebook. Here we would like to check again in closer details and address some of the issues raised in the first notebook.

~Ongoing efforts, your feedback is appreciated!~

### RECAP

In EDA 1, we ended up concluding that the following 66 features can be kept for our model:

```
['ProductName' 'EngineVersion' 'AppVersion' 'AvSigVersion' 'IsBeta'
 'RtpStateBitfield' 'AVProductStatesIdentifier' 'AVProductsInstalled'
 'AVProductsEnabled' 'HasTpm' 'CountryIdentifier' 'CityIdentifier'
 'OrganizationIdentifier' 'GeoNameIdentifier'
 'LocaleEnglishNameIdentifier' 'Processor' 'OsVer' 'OsBuild' 'OsSuite'
 'OsPlatformSubRelease' 'OsBuildLab' 'IsProtected' 'AutoSampleOptIn'
 'SMode' 'IeVerIdentifier' 'SmartScreen' 'Firewall' 'UacLuaenable'
 'Census_MDC2FormFactor' 'Census_DeviceFamily' 'Census_OEMNameIdentifier'
 'Census_OEMModelIdentifier' 'Census_ProcessorCoreCount'
 'Census_ProcessorModelIdentifier' 'Census_PrimaryDiskTotalCapacity'
 'Census_PrimaryDiskTypeName' 'Census_SystemVolumeTotalCapacity'
 'Census_HasOpticalDiskDrive' 'Census_TotalPhysicalRAM'
 'Census_ChassisTypeName'
 'Census_InternalPrimaryDiagonalDisplaySizeInInches'
 'Census_InternalPrimaryDisplayResolutionHorizontal'
 'Census_PowerPlatformRoleName' 'Census_InternalBatteryNumberOfCharges'
 'Census_OSVersion' 'Census_OSBranch' 'Census_OSBuildRevision'
 'Census_OSEdition' 'Census_OSInstallTypeName'
 'Census_OSInstallLanguageIdentifier' 'Census_OSWUAutoUpdateOptionsName'
 'Census_IsPortableOperatingSystem' 'Census_GenuineStateName'
 'Census_ActivationChannel' 'Census_IsFlightsDisabled' 'Census_FlightRing'
 'Census_FirmwareManufacturerIdentifier'
 'Census_FirmwareVersionIdentifier' 'Census_IsSecureBootEnabled'
 'Census_IsVirtualDevice' 'Census_IsTouchEnabled' 'Census_IsPenCapable'
 'Census_IsAlwaysOnAlwaysConnectedCapable' 'Wdft_IsGamer'
 'Wdft_RegionIdentifier' 'HasDetections']
```

But is this correct? 

We also need to consider the following issues:

1. NANs - how to better handle those missing values? (Need a throughout investigation)
2. Skewed variables - how to fix their skewness?
3. More feature selection/combination - can we NOT drop features but map them into another space?
4. Algorithm improvement - the **real** meat and butter.

So this notebook would attempt to address the first 3, and build a baseline SVM model to check its performance.

In [1]:
# load the data from Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import pandas as pd
import numpy as np
import lightgbm as lgb
#import xgboost as xgb
from scipy.sparse import vstack, csr_matrix, save_npz, load_npz
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
#from sklearn.metrics import roc_auc_score
import gc
gc.enable()

In [0]:
pd.set_option('display.max_columns', 100)

In [0]:
# from https://www.kaggle.com/theoviel/load-the-totality-of-the-data
dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int16',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float64', # was 'float32'
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float32', # was 'float16'
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float32', # was 'float16'
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float64', # was 'float32'
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float64', # was 'float32'
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32', # was 'float16'
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32', # was 'float16'
        'Census_InternalPrimaryDisplayResolutionVertical':      'float32', # was 'float16'
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float64', # was 'float32'
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }

In [6]:
train = pd.read_csv('/content/gdrive/My Drive/Coding experiment/MARTHA/data/train.csv', dtype=dtypes, low_memory=True)
train['MachineIdentifier'] = train.index.astype('uint32')
test  = pd.read_csv('/content/gdrive/My Drive/Coding experiment/MARTHA/data/test.csv',  dtype=dtypes, low_memory=True)
test['MachineIdentifier']  = test.index.astype('uint32')
gc.collect()

201054

In [0]:
# take a look at the head
train.head()

Unnamed: 0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_ProcessorClass,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
0,0,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1735.0,0,7.0,0,,53447.0,1.0,1.0,1,29,128035.0,18.0,35.0,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,Desktop,Windows.Desktop,2668.0,9124.0,4.0,5.0,2341.0,,476940.0,HDD,299451.0,0,4096.0,Desktop,18.9,1440.0,900.0,Desktop,,4294967000.0,10.0.17134.165,amd64,rs4_release,17134,165,Professional,PROFESSIONAL,UUPUpgrade,26.0,119,UNKNOWN,0,IS_GENUINE,Retail,,0.0,Retail,,628.0,36144.0,0,,0.0,0,0,0.0,0.0,10.0,0
1,1,win8defender,1.1.14600.4,4.13.17134.1,1.263.48.0,0,7.0,0,,53447.0,1.0,1.0,1,93,1482.0,18.0,119.0,64,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,Notebook,Windows.Desktop,2668.0,91656.0,4.0,5.0,2405.0,,476940.0,HDD,102385.0,0,4096.0,Notebook,13.9,1366.0,768.0,Mobile,,1.0,10.0.17134.1,amd64,rs4_release,17134,1,Professional,PROFESSIONAL,IBSClean,8.0,31,UNKNOWN,0,OFFLINE,Retail,,0.0,NOT_SET,,628.0,57858.0,0,,0.0,0,0,0.0,0.0,8.0,0
2,2,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1341.0,0,7.0,0,,53447.0,1.0,1.0,1,86,153579.0,18.0,64.0,49,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,Desktop,Windows.Desktop,4909.0,317701.0,4.0,5.0,1972.0,,114473.0,SSD,113907.0,0,4096.0,Desktop,21.5,1920.0,1080.0,Desktop,,4294967000.0,10.0.17134.165,amd64,rs4_release,17134,165,Core,CORE,UUPUpgrade,7.0,30,FullAuto,0,IS_GENUINE,OEM:NONSLP,,0.0,Retail,,142.0,52682.0,0,,0.0,0,0,0.0,0.0,3.0,0
3,3,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1527.0,0,7.0,0,,53447.0,1.0,1.0,1,88,20710.0,,117.0,115,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,Desktop,Windows.Desktop,1443.0,275890.0,4.0,5.0,2273.0,,238475.0,UNKNOWN,227116.0,0,4096.0,MiniTower,18.5,1366.0,768.0,Desktop,,4294967000.0,10.0.17134.228,amd64,rs4_release,17134,228,Professional,PROFESSIONAL,UUPUpgrade,17.0,64,FullAuto,0,IS_GENUINE,OEM:NONSLP,,0.0,Retail,,355.0,20050.0,0,,0.0,0,0,0.0,0.0,3.0,1
4,4,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1379.0,0,7.0,0,,53447.0,1.0,1.0,1,18,37376.0,,277.0,75,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,Notebook,Windows.Desktop,1443.0,331929.0,4.0,5.0,2500.0,,476940.0,HDD,101900.0,0,6144.0,Portable,14.0,1366.0,768.0,Mobile,lion,0.0,10.0.17134.191,amd64,rs4_release,17134,191,Core,CORE,Update,8.0,31,FullAuto,0,IS_GENUINE,Retail,0.0,0.0,Retail,0.0,355.0,19844.0,0,0.0,0.0,0,0,0.0,0.0,1.0,1


In [0]:
train.shape

(8921483, 83)

In [0]:
test.shape

(7853253, 82)

### Deal with missing values

In order for the `scikit-learn` implementation of SVM to work, we will have to fill all the missing values.

Note: if using LightGBM or XGBoost we don't have to do feature selection (auto selected, based on decision trees). The 2 ensemble methods have a better tolerance on NANs.

In [0]:
# Nan Values
null_counts = train.isnull().sum()/train.shape[0]
#print(null_counts)
print("Columns with at least 1 NA:")
for index in null_counts[null_counts > 0].index:
  print("name:", index, "types:", train[index].dtypes, '% null:', null_counts[index])

Columns with at least 1 NA:
name: RtpStateBitfield types: float16 % null: 0.003622491910817966
name: DefaultBrowsersIdentifier types: float32 % null: 0.9514163732644001
name: AVProductStatesIdentifier types: float32 % null: 0.004059975230575455
name: AVProductsInstalled types: float16 % null: 0.004059975230575455
name: AVProductsEnabled types: float16 % null: 0.004059975230575455
name: CityIdentifier types: float32 % null: 0.0364747654621995
name: OrganizationIdentifier types: float16 % null: 0.3084148677972037
name: GeoNameIdentifier types: float16 % null: 2.3874954421815298e-05
name: OsBuildLab types: category % null: 2.3538687458127756e-06
name: IsProtected types: float16 % null: 0.00404013547971789
name: PuaMode types: category % null: 0.9997411865269485
name: SMode types: float16 % null: 0.0602768620418825
name: IeVerIdentifier types: float16 % null: 0.006601368853137982
name: SmartScreen types: category % null: 0.35610794752397107
name: Firewall types: float16 % null: 0.010239329

So we have 44 features that have NAN values. I tried to drop those features, but then we will be left with too few parameters. If dropping rows containing at least 1 NAN, we will end up with less than half of the records. Neither way seems to be an ideal solution.

Inspired by [this notebook](https://www.kaggle.com/bogorodvo/lightgbm-baseline-model-using-sparse-matrix), I think we should try converting all features into categorical encoding, adding a "NA" category, then fit with SVM. (To better preserve the data)

In [7]:
print('Transform all features to category.\n')
count = 0
for usecol in train.columns.tolist()[1:-1]:

    if count % 5 == 0:
        print("Processed",count,"features.")
      
    train[usecol] = train[usecol].astype('str')
    test[usecol] = test[usecol].astype('str')
    
    #Fit LabelEncoder
    le = LabelEncoder().fit(
            np.unique(train[usecol].unique().tolist()+
                      test[usecol].unique().tolist()))

    #At the end 0 will be used for dropped values
    train[usecol] = le.transform(train[usecol])+1
    test[usecol]  = le.transform(test[usecol])+1

    agg_tr = (train
              .groupby([usecol])
              .aggregate({'MachineIdentifier':'count'})
              .reset_index()
              .rename({'MachineIdentifier':'Train'}, axis=1))
    agg_te = (test
              .groupby([usecol])
              .aggregate({'MachineIdentifier':'count'})
              .reset_index()
              .rename({'MachineIdentifier':'Test'}, axis=1))

    agg = pd.merge(agg_tr, agg_te, on=usecol, how='outer').replace(np.nan, 0)
    #Select values with more than 1000 observations
    agg = agg[(agg['Train'] > 1000)].reset_index(drop=True)
    agg['Total'] = agg['Train'] + agg['Test']
    #Drop unbalanced values
    agg = agg[(agg['Train'] / agg['Total'] > 0.2) & (agg['Train'] / agg['Total'] < 0.8)]
    agg[usecol+'Copy'] = agg[usecol]

    train[usecol] = (pd.merge(train[[usecol]], 
                              agg[[usecol, usecol+'Copy']], 
                              on=usecol, how='left')[usecol+'Copy']
                     .replace(np.nan, 0).astype('int').astype('category'))

    test[usecol]  = (pd.merge(test[[usecol]], 
                              agg[[usecol, usecol+'Copy']], 
                              on=usecol, how='left')[usecol+'Copy']
                     .replace(np.nan, 0).astype('int').astype('category'))

    del le, agg_tr, agg_te, agg, usecol
    gc.collect()
    count += 1

Transform all features to category.

Processed 0 features.
Processed 5 features.
Processed 10 features.
Processed 15 features.
Processed 20 features.
Processed 25 features.
Processed 30 features.
Processed 35 features.
Processed 40 features.
Processed 45 features.
Processed 50 features.
Processed 55 features.
Processed 60 features.
Processed 65 features.
Processed 70 features.
Processed 75 features.
Processed 80 features.


Take a look at the result. We can see that all columns have been encoded as categories.

In [0]:
train.sample(n=2)

Unnamed: 0,MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,Platform,Processor,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Census_MDC2FormFactor,Census_DeviceFamily,Census_OEMNameIdentifier,Census_OEMModelIdentifier,Census_ProcessorCoreCount,Census_ProcessorManufacturerIdentifier,Census_ProcessorModelIdentifier,Census_ProcessorClass,Census_PrimaryDiskTotalCapacity,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_HasOpticalDiskDrive,Census_TotalPhysicalRAM,Census_ChassisTypeName,Census_InternalPrimaryDiagonalDisplaySizeInInches,Census_InternalPrimaryDisplayResolutionHorizontal,Census_InternalPrimaryDisplayResolutionVertical,Census_PowerPlatformRoleName,Census_InternalBatteryType,Census_InternalBatteryNumberOfCharges,Census_OSVersion,Census_OSArchitecture,Census_OSBranch,Census_OSBuildNumber,Census_OSBuildRevision,Census_OSEdition,Census_OSSkuName,Census_OSInstallTypeName,Census_OSInstallLanguageIdentifier,Census_OSUILocaleIdentifier,Census_OSWUAutoUpdateOptionsName,Census_IsPortableOperatingSystem,Census_GenuineStateName,Census_ActivationChannel,Census_IsFlightingInternal,Census_IsFlightsDisabled,Census_FlightRing,Census_ThresholdOptIn,Census_FirmwareManufacturerIdentifier,Census_FirmwareVersionIdentifier,Census_IsSecureBootEnabled,Census_IsWIMBootEnabled,Census_IsVirtualDevice,Census_IsTouchEnabled,Census_IsPenCapable,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections
2038819,2038819,5,0,0,0,1,7,1,2610,26978,2,2,2,24,0,19,57,235,1,2,1,7,13,5,307,5,2,1,2,0,28,21,2,2,9,2,2071,0,45,6,1696,4,6049,2,0,1,4947,43,108,930,82,4,72,1,0,1,13,122,0,5,3,4,38,82,3,1,2,1,3,1,8,3,110,2195,1,3,1,1,1,1,3,16,0
8578774,8578774,5,0,0,0,1,7,1,2610,26978,2,2,2,186,0,19,284,268,1,2,1,7,13,5,307,5,2,1,2,0,28,21,2,2,3,2,1831,123326,31,6,1389,4,6049,1,383108,1,3147,29,186,930,82,2,72,29961,425,1,13,122,214,5,3,3,6,99,3,1,2,2,3,1,8,3,110,53346,1,3,1,1,1,1,2,11,1


In [9]:
y_target = np.array(train['HasDetections'])
train_ids = train.index
test_ids  = test.index

del train['HasDetections'], train['MachineIdentifier'], test['MachineIdentifier']
gc.collect()

168

In [0]:
y_target.shape

(8921483,)

In [0]:
train.shape

(8921483, 81)

Using dense matrix now to fit SVM. Will try sparse matrix later.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train, y_target, test_size=0.33, random_state=42)

Unfortunately Linear SVC never works on Colab (though we have like 12GB RAM here). That's because scikit-learn implementation requires to see all data at once, and we happen to have a really large dataset. So instead of sparse representation, I will try some online fitting algorithms instead. (See the scikit-learn documentation [here](https://scikit-learn.org/0.15/modules/scaling_strategies.html))

In [0]:
# DON'T TRY TO RUN THIS! IT WON'T WORK!
from sklearn.svm import LinearSVC

clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(X_train, y_train)

In [0]:
clf.score(X_test, y_test)

In [0]:
# Trying SGD Classifier below
from sklearn.linear_model import SGDClassifier

clf = SGDClassifier()

batch_size = 2000
residual = 0
num_of_iter = int(len(X_train) / batch_size)
residual = len(X_train) % batch_size
if num_of_iter * batch_size < len(X_train):
  num_of_iter += 1

In [12]:
print("num_of_iter:",num_of_iter,"residual:",residual)

num_of_iter: 2989 residual: 1393


In [13]:
for i in range(num_of_iter):
  if i != num_of_iter - 1:
    subset = slice(batch_size * i, batch_size * (i + 1))
  else:
    subset = slice(batch_size * i, len(X_train))
  if i % 100 == 0:
    print("Running %s-th iteration." % i)
  clf.partial_fit(X_train[subset], y_train[subset], classes=np.unique(y_train))

Running 0-th iteration.
Running 100-th iteration.
Running 200-th iteration.
Running 300-th iteration.
Running 400-th iteration.
Running 500-th iteration.
Running 600-th iteration.
Running 700-th iteration.
Running 800-th iteration.
Running 900-th iteration.
Running 1000-th iteration.
Running 1100-th iteration.
Running 1200-th iteration.
Running 1300-th iteration.
Running 1400-th iteration.
Running 1500-th iteration.
Running 1600-th iteration.
Running 1700-th iteration.
Running 1800-th iteration.
Running 1900-th iteration.
Running 2000-th iteration.
Running 2100-th iteration.
Running 2200-th iteration.
Running 2300-th iteration.
Running 2400-th iteration.
Running 2500-th iteration.
Running 2600-th iteration.
Running 2700-th iteration.
Running 2800-th iteration.
Running 2900-th iteration.


In [14]:
# predict on unseen data
# a small test below
y_pred = clf.predict(X_test[:10])
y_pred

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1], dtype=int8)

In [17]:
y_test[:10]

array([1, 0, 0, 1, 1, 1, 0, 1, 0, 1], dtype=int8)

In [15]:
# Persist model
from joblib import dump, load

dump(clf, '/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGDClassifier.joblib')

['/content/gdrive/My Drive/Coding experiment/MARTHA/data/MARTHA_SGDClassifier.joblib']

In [16]:
# Getting an accuracy measurement
clf.score(X_test, y_test)

0.5070663600637209

In [18]:
# Confusion matrix
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.54      0.10      0.17   1473693
           1       0.50      0.92      0.65   1470397

   micro avg       0.51      0.51      0.51   2944090
   macro avg       0.52      0.51      0.41   2944090
weighted avg       0.52      0.51      0.41   2944090

