<a href="https://colab.research.google.com/github/pkolachi/geodist2typfeat/blob/master/exptnbs/sigtyp-st2020-part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
%autosave 60
%matplotlib inline
%pylab

fpurl   = 'https://raw.githubusercontent.com/sigtyp/ST2020/master/data/train.csv'
# the header from the csv is not properly tab-seperated. hence hard-coding
header  = ['wals_code', 'name', 
           'latitude', 'longitude', 
           'genus', 'family', 'countrycodes', 
           'features'
          ]

CVFOLDS = 2   # default: 2 folds
N = -1        # default: use all samples 
K = 10        # default: use only 5 feature classes
REPEAT  = -1
# turn this on iff running from command-line to test performance across 
# different values for (CVFOLDS, K, REPEAT) 
BATCH = False  

import itertools as it
from collections import Counter, defaultdict
from operator    import itemgetter
from IPython.display import display as pd_displayHTML

Autosaving every 60 seconds
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [17]:
#%pip install -q pycodestyle_magic flake8
#%load_ext pycodestyle_magic

In [18]:
#%flake8_on -m 119 --ignore=E111

I hoped the provided train/test dataset is CSV compliant so that loading the dataset is as simple as using *pandas.read_csv*. It turned out not to be the case. The problem is with the header in the provided csv file, which makes inferring the columns using *header=auto* impossible. This is easily handled by hard-coding the column names in the header and skipping the first row when using *pandas.read_csv*.

In [19]:
import sys
%pip install -q --user pandas==1.0.3
import pandas as pd
df = pd.read_csv(fpurl, sep='\t', header=None, names=header,
                 #index_col=0,
                 error_bad_lines=True, skiprows=[0])
"""
# since this pynb will never be run on the held-out test set
if CVFOLDS <= 1:
  trnS, tstS = 0, 0   # dummy values for sizes of train and test partitions
else:
  tstdf = pd.read_csv(tstfpurl, sep='\t', header=None, names=header,
                      error_bad_lines=True, skiprows=[0])
  trnS, tstS = df.shape[0], tstdf.shape[0]
  df.append(tstdf)
"""
missValue = '*-missing-*'
featsFull = df.iloc[:, 0:-1]
clablFull = df.iloc[:, -1]
alablInst = Counter(albl for inst in clablFull for albl in inst.split('|'))
alablTabl = pd.DataFrame([{'name': n, 'id': i, 'freq': f}
                          for i,(n,f) in enumerate(alablInst.most_common(), start=1)
                         ]).set_index('name')
alablFull = pd.DataFrame([dict(albl.split('=', 1) for albl in inst.split('|'))
                          for inst in clablFull
                         ]).fillna(missValue) # fill missing values (no NaN) 
for incol in ['wals_code', 'name', 'genus', 'family', 'countrycodes']:
  featsFull[incol] = featsFull[incol].astype('category')
clablFull = clablFull.astype('category')
alablFull = alablFull.astype('category')

print(featsFull.shape, clablFull.shape, alablFull.shape, alablTabl.shape)

Note: you may need to restart the kernel to use updated packages.
(1125, 7) (1125,) (1125, 185) (973, 2)


# Adding manual features

### Ideas
+    Convert latitude and longitude to UTM which can be encoded as a discrete feature


In [20]:
# Adding manual features
# Ideas
## 1. Convert latitude and longitude to UTM which can be encoded as a discrete feature 
###### -- wiki description https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system
###### -- found source code to use: https://github.com/Turbo87/utm
###### -- kaggle question on similar topic https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/discussion/62711

Let's plot a few simple statistics about the dataset. 
1.   Histogram of the complex labels in the dataset
2.   Scatterplots of genus vs labels, family vs labels and countrycodes vs labels

In [21]:
%pip install -q --user seaborn==0.10.0
import seaborn as sns
def plot_datastats(features, clabels, alabels):
  return

if not BATCH:
  plot_datastats(featsFull, clablFull, alablFull)

Note: you may need to restart the kernel to use updated packages.


The dataset is loaded into a DataFrame and seperated into two parts: input features and output labels. 

We know a few things about the input features like what are categorical features and what are numerical features. So, we encode the different columns in the feats DataFrame accordingly. *Hopefully this matters* when training different classifiers (especially thinking of decision trees). 

At this point, I'm not looking at best encoding scheme for the labels which are composite labels themselves (more on this later). The training dataset provided has 1109 unique labels for the dataset of 1125 languages. This indicates that there is *an optimal representation* for the label set.

In [22]:
# sub-select data frame to speed-up experiments while debugging
import random
# because we want sampling without replacement when we work with selected
# features classes to test, using random makes statistics across runs
# incomparable -- so use a uniform distribution to select feature classes for
# comparison across different experiments.
# to get robust estimates while testing, use random selection
if N < 2 or N > featsFull.shape[0]:
  subsid = list(range(featsFull.shape[0]))
else:
  subsid = list(range(0, featsFull.shape[0], featsFull.shape[0]//N))[:N]

if K < 0 or K > alablFull.shape[1]:
  subfci = list(range(alablFull.shape[1]))
else:
  subfci = list(sorted(random.sample(range(alablFull.shape[1]), K)))
  #subfci = list(range(0, alablFull.shape[1], alablFull.shape[1]//K))[:K])
subfcs = list(alablFull.columns[i] for i in subfci)

#relevfeats = [i for i,f in enumerate(header)][2:-1]
featsFull_ = featsFull.iloc[subsid, :]
clablFull_ = clablFull.iloc[subsid]
alablSub_  = alablFull.iloc[subsid, subfci]
alablFull_ = alablFull.iloc[subsid, :]

print(featsFull_.shape, clablFull_.shape, alablFull_.shape, alablSub_.shape)

(1125, 7) (1125,) (1125, 185) (1125, 10)


Let us try a few classifiers using *scikit-learn* at this point. 

For what it is worth, the accuracies can be worse than a coin flip, considering the sparse label set.


 features of languages spoken in close proximity and belonging to the same family should be highly informative in predicting the typographical features for a new language. 

In [23]:
deprecated = """
# it is essential to make deep copies of the frame when building numpy
# matrices used for classification experiments.
# if not, changes to the matrix representations e.g. encoding categorial
# variables as ordinals or sparse-matrices are reflected in the original frame
# which results in errors when trying to re-use the frames for other experiments
# e.g. lookup in the atomic-label table built above results in errors because
# the lookup tries to find fnc=lbl-idx where idx is the category code
X   = featsFull_.copy(deep=False)
ccs = X.select_dtypes(['category']).columns
X[ccs] = X[ccs].apply(lambda x: x.cat.codes)

Y = clablFull_.copy(deep=False).cat.codes

Y_  = alablFull_.copy(deep=False)
ccs = Y_.select_dtypes(['category']).columns
Y_[ccs] = Y_[ccs].apply(lambda x: x.cat.codes)

subY_ = alablSub_.copy(deep=False)
ccs = subY_.select_dtypes(['category']).columns
subY_[ccs] = subY_[ccs].apply(lambda x: x.cat.codes)

X  = X.to_numpy()
Y  = Y.to_numpy()
Y_ = Y_.to_numpy()
subY_ = subY_.to_numpy()

print(X.shape, Y.shape, Y_.shape, subY_.shape)
""";

In [24]:
 %pip install -q --user scikit-learn==0.22.2.post1
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder 

lblenc = LabelEncoder().fit(clablFull_)
Ynms   = lblenc.classes_
Y      = lblenc.transform(clablFull_)

lblenc = OrdinalEncoder().fit(alablFull_)
aYnms  = lblenc.categories_
Y_     = lblenc.transform(alablFull_)

mlablFull = [[alablTabl.loc['{0}={1}'.format(fcn, lbl),'id']
              for fcn,lbl in row.items() if lbl != missValue]
             for row in alablFull_.to_dict(orient='records')
            ]
Ymlbl = MultiLabelBinarizer().fit_transform(mlablFull)

rawX = featsFull_.copy(deep=False)

print(Y.shape, Y_.shape, Ymlbl.shape)

Note: you may need to restart the kernel to use updated packages.
(1125,) (1125, 185) (1125, 973)


In [25]:
if len(subfci) < alablFull_.shape[1]:
  clablSub = ['|'.join('{0}={1}'.format(fcn, lbl) for fcn, lbl in row.items()
                       if lbl != missValue)
              for row in alablSub_.to_dict(orient='records')
             ]
  lblenc   = LabelEncoder().fit(clablSub)
  subYnms  = lblenc.classes_
  subY     = lblenc.transform(clablSub)

  lblenc   = OrdinalEncoder().fit(alablSub_)
  subaYnms = lblenc.categories_
  subY_    = lblenc.transform(alablSub_)
  
  mlablSub = [[alablTabl.loc['{0}={1}'.format(fcn, lbl),'id']
               for fcn,lbl in row.items() if lbl != missValue]
              for row in alablSub_.to_dict(orient='records')
             ]
  subYmlbl = MultiLabelBinarizer().fit_transform(mlablSub)
else:
  subY, subY_, subYmlbl = Y, Y_, Ymlbl

print(subY.shape, subY_.shape, subYmlbl.shape)

(1125,) (1125, 10) (1125, 49)


In [26]:
# https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
# https://scikit-learn.org/stable/modules/cross_validation.html
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder

from sklearn import pipeline as skpipe
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder 

numfxr = SimpleImputer(strategy='median')
strfxr = SimpleImputer(strategy='constant', fill_value='-empty-')
strfx_ = SimpleImputer(strategy='most_frequent')

stdtrn = StandardScaler()
ohetrn = OneHotEncoder(handle_unknown='ignore', sparse=True)
ordtrn = OrdinalEncoder(categories='auto')

# for numerical features, fill unknown feature values with median value and 
# scale the column with mean and variance
numfeats = ['latitude', 'longitude']
numtrans = skpipe.Pipeline(steps=[('imputer', numfxr), ('transform',  stdtrn)])

# for categorial features, try both one-hot encoding and ordinal encoding
catfeats = ['wals_code', 'name', 'genus', 'family', 'countrycodes'] 
ohcattrans = skpipe.Pipeline(steps=[('imputer', strfxr), ('transform', ohetrn)])
ohtrans = ColumnTransformer(transformers=[('num', numtrans, numfeats), 
                                          ('cat', ohcattrans, catfeats)])

oecattrans = skpipe.Pipeline(steps=[('imputer', strfxr), ('transform', ordtrn)])
oetrans = ColumnTransformer(transformers=[('num', numtrans, numfeats), 
                                          ('cat', oecattrans, catfeats)])
# There are known issues in sklearn when integrating OrdinalFeatures into the pipeline
# that cause problems with the standard/usual way of incorporating a transformer
# to extract ordinal features. 
# Below is a 'hacky' version that gets around this and simulates the use of 
# dense numerical features for this classification task -- note that, this is
# still a hack
uniqcats = [rawX.loc[:,catn].unique() for catn in catfeats]
oetrans_ = skpipe.Pipeline(steps=[('imputer', strfx_),
                                  ('transform', OrdinalEncoder(categories=uniqcats))])
altoetrn = ColumnTransformer(transformers=[('num', numtrans, numfeats), 
                                           ('cat', oetrans_, catfeats)])

preprocessors = {'ohe': ohtrans,
                #'ord': oetrans,
                 'ord': altoetrn,
                }

middlelayer = {'dim0': 'passthrough',
              #'pca':  PCA(svd_solver='arpack', random_state=20200408),
               'svd':  TruncatedSVD(algorithm='arpack', random_state=20200408),
              }

In [27]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.linear_model import RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier

# all these classifiers support multi-class classification in sklearn
mclsclfnms = ['knn', 'lsvm', 'svc', 'dt', 'rf', 'mlp', 'adb', 
              'nb', 'ridge', 'dumbase',
             #'gp', 'qda', 
             ]
mclsclfobj = [KNeighborsClassifier(p=1), # works well for all inputs 
              LinearSVC(penalty='l1', dual=False, C=0.01, random_state=20200408),
              SVC(gamma=2, random_state=20200408),
              DecisionTreeClassifier(random_state=20200408),
              RandomForestClassifier(random_state=20200408),
              MLPClassifier(random_state=20200408),
              AdaBoostClassifier(random_state=20200408),
              GaussianNB(),
              RidgeClassifier(random_state=20200408),
              DummyClassifier(strategy="most_frequent"),
              GaussianProcessClassifier(),
              QuadraticDiscriminantAnalysis(),
             ]
mclsclfopt = dict(zip(mclsclfnms, mclsclfobj))

# setup pipelines for combination of preprocessors and classifiers
mclsnullcs = []  #[('ord', 'lsvm'), ('ohe', 'mlp'), ('ord', 'dumbase')]
pipelines  = {(clf, enc, dim):
               skpipe.make_pipeline(preprocessors[enc], middlelayer[dim],
                                    mclsclfopt[clf])
              for clf in mclsclfopt
              for enc in preprocessors
              for dim in middlelayer
              if (enc, clf) not in mclsnullcs
             }
pipelines  = dict(sorted(pipelines.items()))

from sklearn.model_selection import KFold, RepeatedKFold
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold
REPEAT  = REPEAT  if REPEAT  >  1 else 1  # sanity-check
CVFOLDS = CVFOLDS if CVFOLDS >= 2 else 2  # sanity-check
# scikit-learn documentation recommends using StratifiedKFold for classification
# problems to preserve class balance across folds. however, in this case, 
# we use KFold and RepeatedKFold because 
#  number of items in a class <= CVFOLDS (works only with 2 folds for entire dataset)
#  there is not much balance to preserve w.r.t. complex labels
cvsplits = list(RepeatedKFold(n_splits=CVFOLDS, 
                              n_repeats=REPEAT, random_state=20200408
                             ).split(rawX, Y))

statnames = ['classifier', 
             'avg.acc/tst', 'avg.acc/trn', 'std.acc/tst', 'std.acc/trn', 
             'avg.time/prd', 'avg.time/trn'] 
statcodes = ['clfn', 'mtsts', 'mtrns', 'vtsts', 'vtrns', 'predt', 'trint']

# https://machinelearningmastery.com/how-to-fix-futurewarning-messages-in-scikit-learn/
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=DeprecationWarning)
simplefilter(action='ignore', category=FutureWarning)
from sklearn.exceptions import ConvergenceWarning, FitFailedWarning
simplefilter(action='ignore', category=ConvergenceWarning)
simplefilter(action='ignore', category=FitFailedWarning)

In [28]:
def plot_accuracies(accuracies, savefile=None):
  hastime = 'avg.time/pred' in accuracies.columns or 'avg.time/trn' in accuracies.columns
  fig, axs = plt.subplots(2, 2, sharex=True, sharey='row', figsize=(12, 8))
  axs[0,0].bar(accuracies['classifier'], accuracies['avg.acc/tst'],
               yerr=accuracies['std.acc/tst'])
  axs[0,0].text(.5, .9, 'held-out', horizontalalignment='center', 
                transform=axs[0,0].transAxes)
  axs[0,1].bar(accuracies['classifier'], accuracies['avg.acc/trn'],
               yerr=accuracies['std.acc/trn'])
  axs[0,1].text(.5, .9, 'training', horizontalalignment='center', 
                transform=axs[0,1].transAxes)
  if hastime:
    axs[1,0].bar(accuracies['classifier'], accuracies['avg.time/prd'])
    axs[1,0].text(.5, .9, 'prediction', horizontalalignment='center', 
                  transform=axs[1,0].transAxes)
    axs[1,1].bar(accuracies['classifier'], accuracies['avg.time/trn'])
    axs[1,1].text(.5, .9, 'training', horizontalalignment='center', 
                  transform=axs[1,1].transAxes)

  axs[0,0].set_ylim(ymin=0)
  axs[1,0].set_ylim(ymin=0)
  axs[0,0].set_ylabel('Accuracy')
  axs[1,0].set_ylabel('Time(s)')
  # rotate all xtick labels
  _ = [l.set_rotation(90) for a in axs.flatten() for l in a.get_xticklabels()]
  # minimize space between subplots
  plt.subplots_adjust(wspace=0, hspace=0)

  if not savefile:
    plt.show()
  else:
    plt.savefig(savefile)
  return

In [29]:
from sklearn import model_selection as skms

def codify_classifier(clf):
  return clf if isinstance(clf, str) else '/'.join(clf)

def trainFullClassifiersCV(classifiers, X, Y):
  clfaccs = []
  for iclf, nclf in enumerate(classifiers):
    clfsce = skms.cross_validate(classifiers[nclf], X, Y, cv=cvsplits, 
                                 return_train_score=True
                                )
    clfnfo = [codify_classifier(nclf), 
              100*clfsce['test_score'].mean(), 100*clfsce['train_score'].mean(),
              100*clfsce['test_score'].std(),  100*clfsce['train_score'].std(),
              clfsce['score_time'].sum(), clfsce['fit_time'].sum()
             ]
    clfaccs.append(dict(zip(statnames, clfnfo)))
  return pd.DataFrame(clfaccs)

clfaccs = trainFullClassifiersCV(pipelines, rawX, Y)
if not BATCH:
  clfaccs_ = clfaccs.dropna()
  plot_accuracies(clfaccs_)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Users/admin/Library/Python/3.7/lib/python/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-29-1e51870963d9>", line 20, in <module>
    clfaccs = trainFullClassifiersCV(pipelines, rawX, Y)
  File "<ipython-input-29-1e51870963d9>", line 10, in trainFullClassifiersCV
    return_train_score=True
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 236, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/Users/admin/Library/Python/3.7/lib/python/site-packages/joblib/parallel.py", line 924, in __call__
    while self.dispatch_one_batch(iterator):
  File "/Users/admin/Library/Python/3.7/lib/python/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/admin/Library/Python/3.7/

KeyboardInterrupt: 

In [None]:
%time sclfaccs = trainFullClassifiersCV(pipelines, rawX, subY)
if not BATCH:
  sclfaccs_ = sclfaccs.dropna()
  pd_displayHTML(sclfaccs_.style.hide_index())
  plot_accuracies(sclfaccs_)

In [None]:
from sklearn.model_selection import GridSearchCV

paramsKNN = {
             'kneighborsclassifier__n_neighbors':range(3, 11), 
             'kneighborsclassifier__weights':('uniform', 'distance'),
            #'kneighborsclassifier__algorithm':('auto', 'ball_tree', 'kd_tree', 'brute'),
             'kneighborsclassifier__p':(1, 2), 
            }
paramsSVL = {
            #'linearsvc__penalty':('l1', 'l2'),
            #'linearsvc__loss':('squared_hinge', 'hinge'), 
            #'linearsvc__dual':(False, True), 
            #'linearsvc__tol',     
             'linearsvc__C':np.arange(0, 0.25, 0.01),
            #'linearsvc__fit_intercept', 
            #'linearsvc__intercept_scaling', 
            #'linearsvc__class_weight':(None, 'balanced'), 
            #'linearsvc__max_iter', 
            }
paramsSVC = {
             'svc__C':10**-np.arange(-1, 2.5, 0.35), 
             'svc__kernel':('poly', 'rbf', 'sigmoid',), 
             'svc__degree':range(2, 5), 
            #'svc__gamma':np.arange(0, 4, 0.5), 
            #'svc__coef0', 
            #'svc__shrinking', 
            #'svc__tol', 
             'svc__class_weight':(None, 'balanced'), 
            }
paramsMLP = {
            #'mlpclassifier__hidden_layer_sizes':[(size,) for size in range(50, 60, 20)], 
            #'mlpclassifier__activation:('relu', 'logistic', 'tanh'), 
            #'mlpclassifier__solver:('adam', 'lbfgs'), 
             'mlpclassifier__alpha':10.0**np.arange(-7, 4), 
            #'mlpclassifier__max_iter':range(200, 1001, 100),
            #'mlpclassifier__early_stopping':(True, False),
            }  
paramsDT  = {
             'decisiontreeclassifier__criterion':('gini', 'entropy'), 
             'decisiontreeclassifier__max_features':('auto', 'sqrt', 'log2', None), 
             'decisiontreeclassifier__class_weight':(None, 'balanced'),
             'decisiontreeclassifier__ccp_alpha':[i/10 for i in range(0, 11)], 
            #'decisiontreeclassifier__max_depth':, 
            #'decisiontreeclassifier__max_leaf_nodes':, 
            #'decisiontreeclassifier__min_impurity_decrease',
            #'decisiontreeclassifier__min_impurity_split', 
            #'decisiontreeclassifier__min_samples_leaf', 
            #'decisiontreeclassifier__min_samples_split', 
            #'decisiontreeclassifier__min_weight_fraction_leaf', 
            #'decisiontreeclassifier__presort', 
            #'decisiontreeclassifier__random_state', 
            #'decisiontreeclassifier__splitter'
            }
paramsRF  = {
             'randomforestclassifier__n_estimators':range(10, 201, 10),
            #'randomforestclassifier__criterion':('gini', 'entropy'),
            #'randomforestclassifier__max_features':('auto', 'sqrt', 'log2', None),
            #'randomforestclassifier__bootstrap':(True, False),
            #'randomforestclassifier__class_weight':(None, 'balanced'), 
            #'randomforestclassifier__ccp_alpha':[i/10 for i in range(0, 11)],
            #'randomforestclassifier__max_depth', 
            #'randomforestclassifier__min_samples_split', 
            #'randomforestclassifier__min_samples_leaf', 
            #'randomforestclassifier__min_weight_fraction_leaf',
            #'randomforestclassifier__max_leaf_nodes', 
            #'randomforestclassifier__min_impurity_decrease', 
            #'randomforestclassifier__min_impurity_split', 
            #'randomforestclassifier__max_samples', 
            }
paramsDummy = {'dummyclassifier__strategy':('stratified', 'most_frequent', 'prior', 'uniform'),
              #'dummyclassifier__constant':(1), 
              #'dummyclassifier__random_state', 
              }

#print(pipelines[('lsvm', 'ord', 'svd')].get_params().keys())

cmnchoices = set()
for clf in (_ for _ in pipelines if _[0] == 'svc'):
  if clf[1:] == ('ohe', 'pca'):
    continue
  grdmclsclf = GridSearchCV(pipelines[clf], param_grid=paramsSVC, 
                            cv=cvsplits).fit(rawX, subY)
  bstprmsstr = '|'.join("{0}={1}".format(k,v) for k,v in grdmclsclf.best_params_.items())
  print(codify_classifier(clf), grdmclsclf.best_score_, bstprmsstr)
  cmnchoices = cmnchoices.intersection(grdmclsclf.best_params_.items()) if len(cmnchoices) else set(grdmclsclf.best_params_.items())
  grdmclsres = pd.DataFrame(grdmclsclf.cv_results_)
print(cmnchoices)

In [None]:
from sklearn import model_selection as skms

def trainIndClassifiersCV(classifiers, X, matY, return_clfinsts=False):
  lclfinst  = {}  # table to store classifiers for later use
  lclfaccs = np.zeros((matY.shape[-1], len(classifiers), 6))
  avgcaccs = []
  for iclf, nclf in enumerate(classifiers):
    for indY in range(matY.shape[-1]):
      clfsce = skms.cross_validate(classifiers[nclf], X, matY[:,indY], cv=cvsplits,
                                   return_train_score=True, 
                                   return_estimator=return_clfinsts
                                  )
      lclfaccs[indY][iclf] = [100*clfsce['test_score'].mean(), 
                              100*clfsce['train_score'].mean(), 
                              100*clfsce['test_score'].std(),  
                              100*clfsce['train_score'].std(), 
                              clfsce['score_time'].sum(), 
                              clfsce['fit_time'].sum()
                             ]
      if return_clfinsts:
        lclfinst[(nclf, indY)] = clfsce['estimator']
    clfnfo = [codify_classifier(nclf), 
              lclfaccs[:,iclf,0].mean(), lclfaccs[:,iclf,1].mean(),
              lclfaccs[:,iclf,2].mean(), lclfaccs[:,iclf,3].mean(),
              lclfaccs[:,iclf,4].sum(),  lclfaccs[:,iclf,5].sum()
             ]
    avgcaccs.append(dict(zip(statnames, clfnfo)))
  if return_clfinsts:
    return (pd.DataFrame(avgcaccs), lclfinst)
  else:
    return pd.DataFrame(avgcaccs)

%time avgcaccs, lclclfs = trainIndClassifiersCV(pipelines, rawX, subY_, return_clfinsts=True)
if not BATCH:
  avgcaccs_ = avgcaccs.dropna()
  #print(avgcaccs_.round(3).to_markdown(showindex=False))
  pd_displayHTML(avgcaccs_.style.hide_index())
  plot_accuracies(avgcaccs_)  

In [None]:
from sklearn import model_selection as skms
from sklearn import metrics as skmt
from sklearn.exceptions import NotFittedError

def skmt_mlmc_accuracy_score(y_true, y_pred):
  "Classification accuracy for multi-label multi-class problems"
  n_samples = y_true.shape[0]
  return sum(1.0 if np.array_equal(y_true[i], y_pred[i]) else 0
             for i in range(n_samples)
            ) / n_samples

def jntTestIndClassifiersCV(classifiers, X, matY, clfinstances=None):
  if not clfinstances:
    _, clfinstances = trainIndClassifiersCV(classifiers, X, matY, return_clfinsts=True)
  trnpids = list(map(itemgetter(0), cvsplits))
  tstpids = list(map(itemgetter(1), cvsplits))
  # when passing the numerical matrix directly to the classifier
  _predsst = lambda clf,sids: clf.predict(X[sids,:]).reshape((-1, 1))
  # we are now handling the DataFrame directly using pipelines
  predsst = lambda clf, sids: clf.predict(X.iloc[sids,:]).reshape((-1, 1))
  jclfaccs = []
  for iclf, nclf in enumerate(classifiers):
    try:
      tstpreds, trnpreds = [], []
      for cvid, (trnids,tstids) in enumerate(cvsplits):
        indpreds = list(it.starmap(predsst, [(clfinstances[nclf, indY][cvid], tstids)
                                             for indY in range(matY.shape[-1])]))
        tstpreds.append(np.hstack(indpreds))
        indpreds = list(it.starmap(predsst, [(clfinstances[nclf, indY][cvid], trnids)
                                             for indY in range(matY.shape[-1])]))  
        trnpreds.append(np.hstack(indpreds))
        
      tstaccs = 100*np.array([skmt_mlmc_accuracy_score(matY[sids], preds)
                              for sids, preds in zip(tstpids, tstpreds)])
      trnaccs = 100*np.array([skmt_mlmc_accuracy_score(matY[sids], preds)
                              for sids, preds in zip(trnpids, trnpreds)])
      clfnfo = [codify_classifier(nclf), 
                tstaccs.mean(), trnaccs.mean(), tstaccs.std(), trnaccs.std()]
      jclfaccs.append(dict(zip(statnames, clfnfo)))
    except NotFittedError as err:
      continue
  return pd.DataFrame(jclfaccs)

%time jclfaccs = jntTestIndClassifiersCV(pipelines, rawX, subY_, clfinstances=lclclfs)
if not BATCH:
  jclfaccs_ = jclfaccs.dropna()
  pd_displayHTML(jclfaccs_.style.hide_index())
  plot_accuracies(jclfaccs_)


TODO: 

*   hyper-parameter search and pick 3 classifiers
*   also pick an optimal encoding scheme
*   test some manual feature additions like geographic distances
*   also check l1 regularization and see which features matter the most
*   DT/RF & MLP seem to have non-convex optimizations or some other random seed initialization. can't replicate results when run multiple times.

In [None]:
mlblclfnms = ['1knn', '2knn', 'mlp', 'dt', 'rf', 'dumbase']
mlblclfobj = [KNeighborsClassifier(p=1), # Manhattan distance
              KNeighborsClassifier(p=2), # Euclidean distance
              MLPClassifier(activation='logistic', solver='lbfgs', alpha=10, 
                           early_stopping=True, random_state=20200408),
# optimized to reduce overfitting; good accuracy on small feature subsets; 
# terrible accuracy on full dataset; doesn't reduce train/test time
# full hyper-parameter tuning needs to be carried out
              DecisionTreeClassifier(ccp_alpha=0.1, random_state=20200408), 
              RandomForestClassifier(n_estimators=10, ccp_alpha=0.1, random_state=20200408),
              DummyClassifier(strategy='most_frequent')
             ]
mlblclfopt = dict(zip(mlblclfnms, mlblclfobj))
mlblnullcs = []
#mlblnullcs = [('ohe', '1knn'),   # Manhattan works better for OrdinalEncoder 
#              ('ord', '2knn'),   # Euclidean works better for OneHotEncoder
#              ('ord', 'dumbase') # DummyClassifier doesn't care for inp. reprs
#             ]
mlblclfpps = {(clf, enc, dim):
               skpipe.make_pipeline(preprocessors[enc], middlelayer[dim],
                                    mlblclfopt[clf])
              for clf in mlblclfopt
              for enc in preprocessors
              for dim in middlelayer
              if (enc, clf) not in mlblnullcs
             }
mlblclfpps = dict(sorted(mlblclfpps.items()))

%time smlblclfacc = trainFullClassifiersCV(mlblclfpps, rawX, subYmlbl)
if not BATCH:
  smlblclfacc_ = smlblclfacc.dropna()
  pd_displayHTML(smlblclfacc_.style.hide_index())
  plot_accuracies(smlblclfacc_)

In [None]:
from sklearn.model_selection import GridSearchCV

paramsknn = {'kneighborsclassifier__n_neighbors':range(3, 11), 
             'kneighborsclassifier__weights':('uniform', 'distance'),
            #'kneighborsclassifier__algorithm':('auto', 'ball_tree', 'kd_tree', 'brute'),
             'kneighborsclassifier__p':(1, 2), 
            }
paramsmlp = {#'mlpclassifier__hidden_layer_sizes':[(size,) for size in range(50, 60, 20)], 
            #'mlpclassifier__activation:('relu', 'logistic', 'tanh'), 
            #'mlpclassifier__solver:('adam', 'lbfgs'), 
             'mlpclassifier__alpha':10.0**np.arange(-7, 4), 
            #'mlpclassifier__max_iter':range(200, 1001, 100),
            #'mlpclassifier__early_stopping':(True, False),
            }  
paramsdt  = {'decisiontreeclassifier__criterion':('gini', 'entropy'), 
             'decisiontreeclassifier__max_features':('auto', 'sqrt', 'log2', None), 
             'decisiontreeclassifier__class_weight':(None, 'balanced'),
             'decisiontreeclassifier__ccp_alpha':[i/10 for i in range(0, 11)], 
            #'decisiontreeclassifier__max_depth':, 
            #'decisiontreeclassifier__max_leaf_nodes':, 
            #'decisiontreeclassifier__min_impurity_decrease',
            #'decisiontreeclassifier__min_impurity_split', 
            #'decisiontreeclassifier__min_samples_leaf', 
            #'decisiontreeclassifier__min_samples_split', 
            #'decisiontreeclassifier__min_weight_fraction_leaf', 
            #'decisiontreeclassifier__presort', 
            #'decisiontreeclassifier__random_state', 
            #'decisiontreeclassifier__splitter'
            }
paramsrf  = {'randomforestclassifier__n_estimators':range(10, 201, 10),
            #'randomforestclassifier__criterion':('gini', 'entropy'),
            #'randomforestclassifier__max_features':('auto', 'sqrt', 'log2', None),
            #'randomforestclassifier__bootstrap':(True, False),
            #'randomforestclassifier__class_weight':(None, 'balanced'), 
            #'randomforestclassifier__ccp_alpha':[i/10 for i in range(0, 11)],
            #'randomforestclassifier__max_depth', 
            #'randomforestclassifier__min_samples_split', 
            #'randomforestclassifier__min_samples_leaf', 
            #'randomforestclassifier__min_weight_fraction_leaf',
            #'randomforestclassifier__max_leaf_nodes', 
            #'randomforestclassifier__min_impurity_decrease', 
            #'randomforestclassifier__min_impurity_split', 
            #'randomforestclassifier__max_samples', 
            }
paramsdummy = {'dummyclassifier__strategy':('stratified', 'most_frequent', 'prior', 'uniform'),
              #'dummyclassifier__constant':(1), 
              #'dummyclassifier__random_state', 
              }

#print(mlblclfpls['ohe/dumbase'].get_params().keys())
grdmlblclf = GridSearchCV(mlblclfpps[('dumbase', 'ohe', 'dim0')], param_grid=paramsdummy,
                         cv=cvsplits, n_jobs=2).fit(rawX, subYmlbl)
print(grdmlblclf.best_params_)
grdmlblres = pd.DataFrame(grdmlblclf.cv_results_)

In [None]:
%time mlblclfacc = trainFullClassifiersCV(mlblclfpps, rawX, Ymlbl)
if not BATCH:
  mlblclfacc_ = mlblclfacc.dropna()
  pd_displayHTML(mlblclfacc_.style.hide_index())
  plot_accuracies(mlblclfacc_)

In [None]:
expreport = []
try:
  if not BATCH:
    raise UserWarning("Following code in this cell can only be run with BATCH=True from cmd-line")
  CVFOLDS, REPEAT = 2, 5 
  samplec, featcsc = alablFull.shape[0], alablFull.shape[1]
  subsid = list(range(samplec))  
  #params = [(k, 10) for k in range(5, fnccount, 10)]
  params = [(CVFOLDS, REPEAT, k) for k in range(5, fnccount, 10)]
  ATTEMPTS = 10
  #params = list(params)[:1]

  # this is to make sure that this block can be run in standalone mode
  featsFull_ = featsFull.iloc[subsid,:]
  clablFull_ = clablFull.iloc[subsid]
  alablFull_ = alablFull.iloc[subsid,:]

  X   = featsFull_.copy(deep=False)
  ccs = X.select_dtypes(['category']).columns 
  X[ccs] = X[ccs].apply(lambda x: x.cat.codes)
  X  = X.to_numpy()
  
  Y = clablFull_.copy(deep=False).cat.codes
  Y  = Y.to_numpy()

  Y_  = alablFull_.copy(deep=False)
  ccs = Y_.select_dtypes(['category']).columns
  Y_[ccs] = Y_[ccs].apply(lambda x: x.cat.codes)
  Y_ = Y_.to_numpy()
  
  Ymlbl = np.zeros((X.shape[0], alablTabl.shape[0]))
  filidx = np.array([ (irow, alablTabl.loc['{0}={1}'.format(fcn, lbl), 'id'])
                    for irow, row in enumerate(alablFull_.to_dict(orient='records'))
                    for fcn, lbl in row.items() if not pd.isna(lbl)
                   ])
  Ymlbl[[filidx[:,0], filidx[:,1]]] = 1

  for expparam in params:
    cvsplits = list(RepeatedKFold(n_splits=expparam[0], 
                                  n_repeats=expparam[1], random_state=20200408
                                 ).split(X, Y))
    expr1 = trainFullClassifiersCV(classifiers, X, Y)
    # run experiment using X,Y
    expreport.extend(dict([('ExpName', 'fulllbl-dense'), 
                           ('CVF', expparam[0]),
                           ('REPEAT', expparam[1]), 
                           ('K', expparam[2]),
                           ('Params', 'default')
                          ] + \
                          list(row.items()))
                     for row in expr1.to_dict(orient='records')
                    ) 
    expr2 = trainFullClassifiersCV(mlblclasfrs, X, Ymlbl)
    expreport.extend(dict([('ExpName', 'fulllbl-sparse'), 
                           ('CVF', expparam[0]),
                           ('REPEAT', expparam[1]), 
                           ('K', expparam[2]),
                           ('Params', 'default')
                          ] + \
                          list(row.items()))
                     for row in expr2.to_dict(orient='records')
                    ) 

    choices = ncombr(fnccount, expparam[2])
    for trial in range(1, ATTEMPTS+1):  #int(FRACP*choices)
      subfci = list(sorted(random.sample(range(fnccount), expparam[2])))
      subfcs = list(alablFull.columns[i] for i in subfci)
      
      alablSub_  = alablFull.iloc[subsid,subfci]

      clablSub = ['|'.join('{0}={1}'.format(k,v)
                           for k,v in row.items() if not pd.isna(v))
                  for row in alablSub_.to_dict(orient='records')
                 ]
      clablSub = pd.Series(clablSub, name=header[-1])
      subY = clablSub.astype('category').cat.codes
      subY = subY.to_numpy()

      subY_ = alablSub_.copy(deep=False)
      ccs = subY_.select_dtypes(['category']).columns
      subY_[ccs] = subY_[ccs].apply(lambda x: x.cat.codes)
      subY_ = subY_.to_numpy()

      subYmlbl = np.zeros((Y.shape[0], alablTabl.shape[0]))
      filidx = np.array([ (irow, alablTabl.loc['{0}={1}'.format(fcn, lbl), 'id'])
                         for irow, row in enumerate(alablSub_.to_dict(orient='records'))
                         for fcn, lbl in row.items() if not pd.isna(lbl)
                       ])
      subYmlbl[[filidx[:,0], filidx[:,1]]] = 1
      
      expr1 = trainFullClassifiersCV(classifiers, X, subY)
      expreport.extend(dict([('ExpName', 'sublbl-dense'), ('CVF', expparam[0]),
                             ('REPEAT', expparam[1]), ('K', expparam[2]),
                             ('TRIAL', 'T{0}'.format(trial)), 
                             ('Params', 'default')
                            ] + \
                            list(row.items()))
                       for row in expr1.to_dict(orient='records')) 
      
      expr2 = trainFullClassifiersCV(mlblclasfrs, X, subYmlbl)
      expreport.extend(dict([('ExpName', 'sublbl-sparse'), ('CVF', expparam[0]),
                             ('REPEAT', expparam[1]), ('K', expparam[2]),
                             ('TRIAL', 'T{0}'.format(trial)), 
                             ('Params', 'default')
                            ] + \
                            list(row.items()))
                       for row in expr2.to_dict(orient='records')) 
      
      expr3, clfs = trainIndClassifiersCV(classifiers, X, subY_, return_clfinsts=True)
      expreport.extend(dict([('ExpName', 'sublbl-dense-ind'), ('CVF', expparam[0]),
                             ('REPEAT', expparam[1]), ('K', expparam[2]),
                             ('TRIAL', 'T{0}'.format(trial)),
                             ('Params', 'default')
                            ] + \
                            list(row.items()))
                       for row in expr3.to_dict(orient='records')) 
      
      expr4 = jntTestIndClassifiersCV(classifiers, X, subY_, clfs)
      expreport.extend(dict([('ExpName', 'sublbl-dense-jnt'), ('CVF', expparam[0]),
                             ('REPEAT', expparam[1]), ('K', expparam[2]),
                             ('TRIAL', 'T{0}'.format(trial)),
                             ('Params', 'default')
                            ] + \
                            list(row.items()))
                       for row in expr4.to_dict(orient='records')) 
  
  pd.DataFrame(expreport).to_html('sigtyp-st2020-part1-batchexps-results.html', index=False)
  pd.DataFrame(expreport).to_json('sigtyp-st2020-part1-batchexps-results.json')
except KeyboardInterrupt:
  pd.DataFrame(expreport).to_html('sigtyp-st2020-part1-batchexps-results-aborted.html', index=False)
  pd.DataFrame(expreport).to_json('sigtyp-st2020-part1-batchexps-results-aborted.json')
except UserWarning as err:
  print(err)