# Introduction

In this notebook I try to replicate the main results of the following paper:
> Varun Teja Gundabathula and V. Vaidhehi, An Efficient Modelling of Terrorist Groups in India using Machine Learning Algorithms, Indian Journal of Science and Technology, Vol 11(15), DOI: 10.17485/ijst/2018/v11i15/121766, April 2018

There, the authors used to Global Terrorism Database to predict perpetrator groups based on different attributes of the past incidences.

# Libraries

## Fundamentals

In [1]:
clear

[H[2J

In [2]:
import pandas as pd
import numpy as np

## Preprocessing

In [3]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from scipy.sparse import csr_matrix

## Cross validation

In [4]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_validate, StratifiedKFold, StratifiedShuffleSplit

## Models

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
#from sklearn.linear_model import LogisticRegression

## Metrics

In [6]:
from sklearn.metrics import accuracy_score, precision_score, classification_report, roc_auc_score

## Memory management
### Setting the temp folder
This is required for "jobs=-1" to work on Kaggle at some cases (see https://www.kaggle.com/getting-started/45288#292143)

In [7]:
%env JOBLIB_TEMP_FOLDER=/tmp

env: JOBLIB_TEMP_FOLDER=/tmp


## Garbage collector

In [8]:
import gc

# Loading the data

Previously we loaded the data and created a sample out of it. From now on we are going to use it for our analysis and modeling.

In [9]:
# Instead of the excel from their homepage, I use the csv version they uploaded to Kaggle
#gtd = pd.read_excel("globalterrorismdb_0617dist.xlsx")
gtd = pd.read_csv("./input/globalterrorismdb_0617dist.csv", encoding='ISO-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
# In case we want to use a sample
gtd_ori = gtd
gtd = gtd.sample(frac=0.1)

# Data transformations

The authors of the paper trained and tested the model on a subsample of the whole dataset, namely:
* Between 1970-2015
* Incidents related to India
* Cases to which the dataset attributes a perpetrator group

They also preselected a number of features for the model.

# II. Using a pipeline

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin , BaseEstimator

The authors of the paper trained and tested the model on a subsample of the whole dataset, namely:
* Between 1970-2015
* Incidents related to India
* Cases to which the dataset attributes a perpetrator group

They also preselected a number of features for the model.

# Building the pipeline

## Description of steps

1. Selecting samples
 - from India
 - between 1970 and 2015
 - those of which have known perpetartors
 - those of which belong to perpetrators with at least two connected incidents

Affects the whole training dataset (both X and Y).
 
2. Selecting columns
 - 'iyear'
 - 'imonth'
 - 'iday'
 - 'extended'
 - 'provstate'
 - 'city'
 - 'attacktype1_txt'
 - 'targtype1_txt'
 - 'nperps'
 - 'weaptype1_txt'
 - 'nkill'
 - 'nwound'
 - 'nhostkid'

Affects only the training attributes.

3. Creating a new attribute from the following three:
 - 'nkill'
 - 'nwound'
 - 'nhostkid'

Affects only the training attributes.

4. Converting the categorical attributes into binary form
5. Converting the target labels into binary form



6. Converting both the attributes and labels to sparse matrices
7. Rebalancing the dataset based on the target classes
8. Running the model

In [13]:
steps_in_val_dict = {
    'Set location': [1, 1, 'Problem dependent', 1],
    'Set period': [1, 1, 'Affects results', 'Affects results'],
    'Only known samples': [1, 1, 'Affects results', 'Affects results'],
    'Only groups with more than 1 incident': [1, 1, 'Affects results', 'Affects results'],
    'Feature seleciton': [1, 0, 1, 0],
    'New feature creation': [1, 0, 0, 0],
    'Binarizing X': [1, 0, 1, 0],
    'Binarize y': [0, 1, 0, 1],
    'Transfrom X to sparse': [1, 0, 1, 0],
    'Transform y to sparse': [1, 0, 1, 0],
    'SMOTE': [1, 1, 'Affects model', 'Affects model']
}

steps_in_val = pd.DataFrame(steps_in_val_dict, index=('train_x', 'train_y', 'val_x', 'val_y')).T
steps_in_val

Unnamed: 0,train_x,train_y,val_x,val_y
Set location,1,1,Problem dependent,1
Set period,1,1,Affects results,Affects results
Only known samples,1,1,Affects results,Affects results
Only groups with more than 1 incident,1,1,Affects results,Affects results
Feature seleciton,1,0,1,0
New feature creation,1,0,0,0
Binarizing X,1,0,1,0
Binarize y,0,1,0,1
Transfrom X to sparse,1,0,1,0
Transform y to sparse,1,0,1,0


## Custom transformers

In [136]:
def showXy(X, y=None):
        if isinstance(X, tuple):
            print("X[0].shape: {}".format(X[0].shape))
            print("X[1].shape: {}".format(X[1].shape))
            
        else:
            print("X.shape: {}".format(X.shape))
            
        if isinstance(y, (np.ndarray, pd.Series, tuple, pd.DataFrame, int)):
            print("y.shape: {}".format(y.shape))
        
        #print("self.y.shape: {}".format(self.y.shape))

In [137]:
class CodeMergeNaff(TransformerMixin, BaseEstimator):
    def __init__(self, col_bins_codes):
        self.cbc = col_bins_codes
        
        #print("\nCodeMergeNaff __init__")
        #showXy(X, y)
        
        
    def fit(self, X, y):
        #print("\nCodeMergeNaff fit START")
        #showXy(X, y)
        
        return self

    def naff(self, X, crits):
        #print("\nCodeMergeNaff naff START")
        #showXy(X, y)        

        n = pd.Series('_')
        
        for key, i in zip(crits.keys(), range(len(crits))):
            i = X.loc[:, key].copy()

            i[X.loc[:,key] == 0] = 'n'
            i[(X.loc[:,key] > 0) 
              & (X.loc[:,key] < crits[key][0])] = crits[key][2][2]
            i[(X.loc[:,key] <= crits[key][1]) 
              & (X.loc[:,key] >= crits[key][0])] = crits[key][2][1]
            i[X.loc[:,key] > crits[key][1]] = crits[key][2][0]
            
            n = pd.concat((n, i), axis=1)
            # print(n.columns)
            
        n.drop(columns=0, inplace=True)

        #print("\nCodeMergeNaff naff END")
        #showXy(X, y)

        return n
    
    def transform(self, X):
        print("\nCodeMergeNaff transform START")
        #showXy(X, y)
        
        naffect = self.naff(X, self.cbc)
        
        naffect.nhostkid[naffect.nhostkid == -99] = np.NaN
        naffect.replace(np.NaN, 'n', inplace=True)
        naffect = naffect.iloc[:,0] +  naffect.iloc[:,1] +  naffect.iloc[:,2]
        
        X.drop(columns=['nkill', 'nwound', 'nhostkid'], inplace=True)
        X['naffect'] = naffect

        X.nperps.where(X.nperps != -99, 0, inplace=True)
        X.nperps.fillna(0, inplace=True)
        
        print("CodeMergeNaff transform END")
        return X

In [138]:
class ProcPerf(TransformerMixin, BaseEstimator):
    def __init__(self):
        #print("\nProcPerf __init__")
        #showXy(X, y)
        pass
    
    def fit(self, X, y):
        #print("\nProcPerf fit START")
        #showXy(X, y)
               
        return self  

    def dtypeconv(self, X):
        #print("\nProcPerf dtypeconv START")
        #showXy(X) # y
        
        X.loc[:,['imonth', 'iday', 'extended']] = X.loc[:,['imonth', 'iday', 'extended']]\
                                                   .astype('int8', copy=False) 
        
        X.loc[:,['iyear', 'nperps']] = X.loc[:,['iyear', 'nperps']]\
                                        .astype('int16', copy=False)
        
        X.loc[:,X.select_dtypes(object).columns.values] = X.loc[:, X.select_dtypes(object).columns.values]\
                                                           .astype('category', copy=False)
            
        #print("\nProcPerf dtypeconv END")
        #showXy(X) #y
        #print(X.dtypes)

        return X   

    def cattobin(self, X, y):
        #print("\nProcPerf cattobin START")
        #showXy(X, y)
               
        X = pd.get_dummies(X)
        X = csr_matrix(X)

        label_encoder = LabelEncoder()
        y = label_encoder.fit_transform(y)
        y = y.astype('int16')
        
        #print("\nProcPerf cattobin END")
        #showXy(X, y)
        
        return X, y
    
    def transform(self, X, y):
        print("\nProcPerf transform START")
        #showXy(X) #y
        #print("self.y.shape: {}".format(self.y.shape))
        
        y = y
        
        #X_naf = naff(X, crits)
        X = self.dtypeconv(X)
        X, y = self.cattobin(X, y)
        
        print("ProcPerf transform END")
        #showXy(X) #y
        #print("self.y.shape: {}".format(self.y.shape))
                
        return X, y

In [139]:
class GTDFilter(TransformerMixin, BaseEstimator):
    '''
    TransformerMixing; gives the `fit_transform` method (beyond `fit` and `transform`)
    BaseEstimator: provides parameters usable for Grid search
    '''
    
    def __init__(self, startdate, enddate, country, columns, onlyknown=False,  min_inc=1):
        self.sdate = startdate
        self.edate = enddate
        self.country = country
        self.cols = columns
        self.ok = onlyknown
        self.minc = min_inc
        
        #print("\nGTDFilter __init__")
        #showXy(X, y)
    
    def fit(self, X, y):
        self.y = y
    
        print("\nGTDFilter fit")
        showXy(X, y)
        print("self.y.shape: {}".format(self.y.shape))
        
        return self
        
    def transform(self, X, y):
        print("\nGTDFilter transform START")
        showXy(X, y)
        print("self.y.shape: {}".format(self.y.shape))
        
        #self.y_work = self.y
        
        if self.ok:
            #nukcrit = self.y_work != 'Unknown'
            nukcrit = y != 'Unknown'
            X = X[nukcrit]
            y = y[nukcrit]
            # self.y_work = self.y_work[nukcrit]
                           
        filcrit = (X.iyear >= max(self.sdate, X.iyear.min())) \
                  & (X.iyear <= min(self.edate, X.iyear.max())) \
                  & (X.country_txt == 'India')
        
        X =  X[filcrit].loc[:,self.cols]
        #self.y_work = self.y_work[filcrit]
        y = y[filcrit]
        
        micrit = y.isin(y.value_counts() \
                        [y.value_counts() >= self.minc] \
                        .index.values)
        
        X = X[micrit]
        y = y[micrit]
        # self.y_work = self.y_work[micrit]
        
        print("GTDFilter transform END")
        showXy(X, y)
        print("self.y.shape: {}".format(self.y.shape))
                
        return X, y #self.y_work

In [140]:
class TransWrap(TransformerMixin, BaseEstimator):
    def __init__(self, transformer, ytrans=False):
        #print("\nTransWrap __init__")
        #showXy(X, y)
        
        self.trans = transformer
        self.ytrans = ytrans
        self.actions = ('transform', 'sample', 'predict', 'score', 'params')
        
    def fit(self, X, y):
        print("\nTransWrap fit START")
        showXy(X, y)
       
        
        self.y = y
        print('self.y.shape: {}'.format(self.y.shape))
    
        if isinstance(X, tuple):
            X, self.y = X[0], X[1]
            #print(X.isna().any().any())
            #print(X)
            self.trans.fit(X, self.y)
        else:
            #print(X.isna().any().any())
            self.trans.fit(X, y)
        
        return self

    def predict(self, X, y=None):
        y = self.y
        
        if isinstance(X, tuple):
            X, self.y = X[0], X[1]
        
        return self.trans.predict(X)
    
    def predict_proba(self, X):
        y = self.y
        return self.trans.predict_proba(X)
    
    def transform(self, X, y=None):
        print("\nTransWrap transform START")
        showXy(X) #y
        print('self.y.shape: {}'.format(self.y.shape))
        
        y = self.y
        
        if isinstance(X, tuple):
            X, y = X[0], X[1]
            #self.y = y
        
        def select_action(actions, l):
            if len(actions) > 0:
                if actions[0] in dir(l):
                    print("\n{}\n\tAction: {}".format(l, actions[0]))
                    return actions[0]
                return select_action(actions[1:], l)
            else:
                print("Transformer action is not within defined options.")
                return None
        
        action = select_action(self.actions, self.trans)
               
        if self.ytrans:
            showXy(X, y)
            # print(y)
            new_X, new_y = getattr(self.trans, action)(X, y)
            #if action == 'sample':
            #    new_X, new_y = getattr(self.trans, action)(X, y)
            #else:
            #    new_X, new_y = getattr(self.trans, action)(X)

        else:
            new_X = getattr(self.trans, action)(X)
            new_y = y
            
        print("TransWrap transform END")
        print("new_X.shape: {}".format(new_X.shape))
        print("new_y.shape: {}".format(new_y.shape))

        return new_X, new_y

In [141]:
crits = {'nkill': (62, 124, 'abc'), 
         'nwound': (272, 544, 'def'), 
         'nhostkid': (400, 800, 'ghi')}

# The Pipeline

In [153]:
pipe = Pipeline([
    ('filter', TransWrap(GTDFilter(startdate=1970,
                                           enddate=2015, 
                                           country='India', 
                                           onlyknown=True, 
                                           columns=('iyear', 'imonth', 'iday', 'extended', 'provstate', 'city',
                                                    'attacktype1_txt', 'targtype1_txt',  'nperps', 'weaptype1_txt', 
                                                    'nkill', 'nwound', 'nhostkid'),
                                           min_inc=2),
                                 ytrans=True)),
    ('naffect recoder', TransWrap(CodeMergeNaff(col_bins_codes=crits))),
    ('perfproc', TransWrap(ProcPerf(), ytrans=True)),
    #('smote', TransWrap(SMOTE(ratio='all', k_neighbors=1, n_jobs=-1), ytrans=True)),
    #('DTC', TransWrap(DecisionTreeClassifier())),
])

In [154]:
validation_size = 0.2
seed = 17

In [155]:
X_train, X_validation, y_train, y_validation = train_test_split(gtd.drop(columns='gname'), gtd.gname, test_size=validation_size, random_state=seed)
print(X_train.shape)
print(X_validation.shape)
print(y_train.shape)
print(y_validation.shape)

(13628, 134)
(3407, 134)
(13628,)
(3407,)


In [156]:
%time
# result = pipe.predict(X_validation)
transX, transy = pipe.fit_transform(X_train, y_train)
#print(resX.describe())
#print(resX.describe(include=object))
#gtd.loc[resy.index,:].gname.value_counts()
#gtd.loc[resX.index,:].country_txt.value_counts()
#resy.value_counts()

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 11.4 µs

TransWrap fit START
X.shape: (13628, 134)
y.shape: (13628,)
self.y.shape: (13628,)

GTDFilter fit
X.shape: (13628, 134)
y.shape: (13628,)
self.y.shape: (13628,)

TransWrap transform START
X.shape: (13628, 134)
self.y.shape: (13628,)

GTDFilter(columns=None, country='India', enddate=None, min_inc=None,
     onlyknown=None, startdate=None)
	Action: transform
X.shape: (13628, 134)
y.shape: (13628,)

GTDFilter transform START
X.shape: (13628, 134)
y.shape: (13628,)
self.y.shape: (13628,)
GTDFilter transform END
X.shape: (488, 13)
y.shape: (488,)
self.y.shape: (13628,)
TransWrap transform END
new_X.shape: (488, 13)
new_y.shape: (488,)

TransWrap fit START
X[0].shape: (488, 13)
X[1].shape: (488,)
y.shape: (13628,)
self.y.shape: (13628,)

TransWrap transform START
X[0].shape: (488, 13)
X[1].shape: (488,)
self.y.shape: (488,)

CodeMergeNaff(col_bins_codes=None)
	Action: transform

CodeMergeNaff transform START
CodeMergeNaff trans

In [158]:
resX

(array([5, 5, 4, ..., 1, 1, 1], dtype=int16),
 array([5, 5, 4, ..., 1, 1, 1], dtype=int16))

In [170]:
resy.shape

(6394,)

In [169]:
resX[0].shape

(8450,)

In [167]:
(resX[1] == resX[0]).all()

True

In [168]:
model = DecisionTreeClassifier().fit(resX[0], resy)

ValueError: Expected 2D array, got 1D array instead:
array=[5. 5. 4. ... 1. 1. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [147]:
scoring='accuracy'
cv=3
n_jobs=-1
# max_features = 2500
X = gtd.drop(columns='gname')
y = gtd.gname

In [148]:
cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)


TransWrap fit START
X.shape: (11356, 134)
y.shape: (11356,)
self.y.shape: (11356,)

TransWrap fit START
X.shape: (11357, 134)
y.shape: (11357,)
self.y.shape: (11357,)


JoblibAttributeError: JoblibAttributeError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/home/andras/anaconda3/lib/python3.6/runpy.py in _run_module_as_main(mod_name='ipykernel_launcher', alter_argv=1)
    188         sys.exit(msg)
    189     main_globals = sys.modules["__main__"].__dict__
    190     if alter_argv:
    191         sys.argv[0] = mod_spec.origin
    192     return _run_code(code, main_globals, None,
--> 193                      "__main__", mod_spec)
        mod_spec = ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py')
    194 
    195 def run_module(mod_name, init_globals=None,
    196                run_name=None, alter_sys=False):
    197     """Execute a module's code without importing it

...........................................................................
/home/andras/anaconda3/lib/python3.6/runpy.py in _run_code(code=<code object <module> at 0x7f330fd368a0, file "/...3.6/site-packages/ipykernel_launcher.py", line 5>, run_globals={'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/home/andras/anaconda3/lib/python3.6/site-packages/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/home/andras.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}, init_globals=None, mod_name='__main__', mod_spec=ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), pkg_name='', script_name=None)
     80                        __cached__ = cached,
     81                        __doc__ = None,
     82                        __loader__ = loader,
     83                        __package__ = pkg_name,
     84                        __spec__ = mod_spec)
---> 85     exec(code, run_globals)
        code = <code object <module> at 0x7f330fd368a0, file "/...3.6/site-packages/ipykernel_launcher.py", line 5>
        run_globals = {'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/home/andras/anaconda3/lib/python3.6/site-packages/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/home/andras.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}
     86     return run_globals
     87 
     88 def _run_module_code(code, init_globals=None,
     89                     mod_name=None, mod_spec=None,

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py in <module>()
     11     # This is added back by InteractiveShellApp.init_path()
     12     if sys.path[0] == '':
     13         del sys.path[0]
     14 
     15     from ipykernel import kernelapp as app
---> 16     app.launch_new_instance()

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    653 
    654         If a global instance already exists, this reinitializes and starts it
    655         """
    656         app = cls.instance(**kwargs)
    657         app.initialize(argv)
--> 658         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    659 
    660 #-----------------------------------------------------------------------------
    661 # utility functions, for convenience
    662 #-----------------------------------------------------------------------------

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    481         if self.poller is not None:
    482             self.poller.start()
    483         self.kernel.start()
    484         self.io_loop = ioloop.IOLoop.current()
    485         try:
--> 486             self.io_loop.start()
        self.io_loop.start = <bound method BaseAsyncIOLoop.start of <tornado.platform.asyncio.AsyncIOMainLoop object>>
    487         except KeyboardInterrupt:
    488             pass
    489 
    490 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py in start(self=<tornado.platform.asyncio.AsyncIOMainLoop object>)
    122         except (RuntimeError, AssertionError):
    123             old_loop = None
    124         try:
    125             self._setup_logging()
    126             asyncio.set_event_loop(self.asyncio_loop)
--> 127             self.asyncio_loop.run_forever()
        self.asyncio_loop.run_forever = <bound method BaseEventLoop.run_forever of <_Uni...EventLoop running=True closed=False debug=False>>
    128         finally:
    129             asyncio.set_event_loop(old_loop)
    130 
    131     def stop(self):

...........................................................................
/home/andras/anaconda3/lib/python3.6/asyncio/base_events.py in run_forever(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
    417             sys.set_asyncgen_hooks(firstiter=self._asyncgen_firstiter_hook,
    418                                    finalizer=self._asyncgen_finalizer_hook)
    419         try:
    420             events._set_running_loop(self)
    421             while True:
--> 422                 self._run_once()
        self._run_once = <bound method BaseEventLoop._run_once of <_UnixS...EventLoop running=True closed=False debug=False>>
    423                 if self._stopping:
    424                     break
    425         finally:
    426             self._stopping = False

...........................................................................
/home/andras/anaconda3/lib/python3.6/asyncio/base_events.py in _run_once(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
   1429                         logger.warning('Executing %s took %.3f seconds',
   1430                                        _format_handle(handle), dt)
   1431                 finally:
   1432                     self._current_handle = None
   1433             else:
-> 1434                 handle._run()
        handle._run = <bound method Handle._run of <Handle BaseAsyncIOLoop._handle_events(11, 1)>>
   1435         handle = None  # Needed to break cycles when an exception occurs.
   1436 
   1437     def _set_coroutine_wrapper(self, enabled):
   1438         try:

...........................................................................
/home/andras/anaconda3/lib/python3.6/asyncio/events.py in _run(self=<Handle BaseAsyncIOLoop._handle_events(11, 1)>)
    140             self._callback = None
    141             self._args = None
    142 
    143     def _run(self):
    144         try:
--> 145             self._callback(*self._args)
        self._callback = <bound method BaseAsyncIOLoop._handle_events of <tornado.platform.asyncio.AsyncIOMainLoop object>>
        self._args = (11, 1)
    146         except Exception as exc:
    147             cb = _format_callback_source(self._callback, self._args)
    148             msg = 'Exception in callback {}'.format(cb)
    149             context = {

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py in _handle_events(self=<tornado.platform.asyncio.AsyncIOMainLoop object>, fd=11, events=1)
    112             self.writers.remove(fd)
    113         del self.handlers[fd]
    114 
    115     def _handle_events(self, fd, events):
    116         fileobj, handler_func = self.handlers[fd]
--> 117         handler_func(fileobj, events)
        handler_func = <function wrap.<locals>.null_wrapper>
        fileobj = <zmq.sugar.socket.Socket object>
        events = 1
    118 
    119     def start(self):
    120         try:
    121             old_loop = asyncio.get_event_loop()

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    271         # Fast path when there are no active contexts.
    272         def null_wrapper(*args, **kwargs):
    273             try:
    274                 current_state = _state.contexts
    275                 _state.contexts = cap_contexts[0]
--> 276                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    277             finally:
    278                 _state.contexts = current_state
    279         null_wrapper._wrapped = True
    280         return null_wrapper

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    445             return
    446         zmq_events = self.socket.EVENTS
    447         try:
    448             # dispatch events:
    449             if zmq_events & zmq.POLLIN and self.receiving():
--> 450                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    451                 if not self.socket:
    452                     return
    453             if zmq_events & zmq.POLLOUT and self.sending():
    454                 self._handle_send()

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    475             else:
    476                 raise
    477         else:
    478             if self._recv_callback:
    479                 callback = self._recv_callback
--> 480                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function wrap.<locals>.null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    481         
    482 
    483     def _handle_send(self):
    484         """Handle a send event."""

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function wrap.<locals>.null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    427         close our socket."""
    428         try:
    429             # Use a NullContext to ensure that all StackContexts are run
    430             # inside our blanket exception handler rather than outside.
    431             with stack_context.NullContext():
--> 432                 callback(*args, **kwargs)
        callback = <function wrap.<locals>.null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    433         except:
    434             gen_log.error("Uncaught exception in ZMQStream callback",
    435                           exc_info=True)
    436             # Re-raise the exception so that IOLoop.handle_callback_exception

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    271         # Fast path when there are no active contexts.
    272         def null_wrapper(*args, **kwargs):
    273             try:
    274                 current_state = _state.contexts
    275                 _state.contexts = cap_contexts[0]
--> 276                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    277             finally:
    278                 _state.contexts = current_state
    279         null_wrapper._wrapped = True
    280         return null_wrapper

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    278         if self.control_stream:
    279             self.control_stream.on_recv(self.dispatch_control, copy=False)
    280 
    281         def make_dispatcher(stream):
    282             def dispatcher(msg):
--> 283                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    284             return dispatcher
    285 
    286         for s in self.shell_streams:
    287             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {'allow_stdin': True, 'code': 'cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)', 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 8, 17, 11, 37, 31, 11184, tzinfo=tzutc()), 'msg_id': '73c896325b4dad10eee143b0bdb9f902', 'msg_type': 'execute_request', 'session': 'c783d31e4786cafb6480ce08b99df8a3', 'username': '', 'version': '5.2'}, 'metadata': {}, 'msg_id': '73c896325b4dad10eee143b0bdb9f902', 'msg_type': 'execute_request', 'parent_header': {}})
    228             self.log.warn("Unknown message type: %r", msg_type)
    229         else:
    230             self.log.debug("%s: %s", msg_type, msg)
    231             self.pre_handler_hook()
    232             try:
--> 233                 handler(stream, idents, msg)
        handler = <bound method Kernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = [b'c783d31e4786cafb6480ce08b99df8a3']
        msg = {'buffers': [], 'content': {'allow_stdin': True, 'code': 'cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)', 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 8, 17, 11, 37, 31, 11184, tzinfo=tzutc()), 'msg_id': '73c896325b4dad10eee143b0bdb9f902', 'msg_type': 'execute_request', 'session': 'c783d31e4786cafb6480ce08b99df8a3', 'username': '', 'version': '5.2'}, 'metadata': {}, 'msg_id': '73c896325b4dad10eee143b0bdb9f902', 'msg_type': 'execute_request', 'parent_header': {}}
    234             except Exception:
    235                 self.log.error("Exception in message handler:", exc_info=True)
    236             finally:
    237                 self.post_handler_hook()

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=[b'c783d31e4786cafb6480ce08b99df8a3'], parent={'buffers': [], 'content': {'allow_stdin': True, 'code': 'cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)', 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 8, 17, 11, 37, 31, 11184, tzinfo=tzutc()), 'msg_id': '73c896325b4dad10eee143b0bdb9f902', 'msg_type': 'execute_request', 'session': 'c783d31e4786cafb6480ce08b99df8a3', 'username': '', 'version': '5.2'}, 'metadata': {}, 'msg_id': '73c896325b4dad10eee143b0bdb9f902', 'msg_type': 'execute_request', 'parent_header': {}})
    394         if not silent:
    395             self.execution_count += 1
    396             self._publish_execute_input(code, parent, self.execution_count)
    397 
    398         reply_content = self.do_execute(code, silent, store_history,
--> 399                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    400 
    401         # Flush output before sending the reply.
    402         sys.stdout.flush()
    403         sys.stderr.flush()

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code='cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)', silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    203 
    204         self._forward_input(allow_stdin)
    205 
    206         reply_content = {}
    207         try:
--> 208             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = 'cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)'
        store_history = True
        silent = False
    209         finally:
    210             self._restore_input()
    211 
    212         if res.error_before_exec is not None:

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=('cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)',), **kwargs={'silent': False, 'store_history': True})
    532             )
    533         self.payload_manager.write_payload(payload)
    534 
    535     def run_cell(self, *args, **kwargs):
    536         self._last_traceback = None
--> 537         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = ('cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)',)
        kwargs = {'silent': False, 'store_history': True}
    538 
    539     def _showtraceback(self, etype, evalue, stb):
    540         # try to preserve ordering of tracebacks and print statements
    541         sys.stdout.flush()

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell='cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)', store_history=True, silent=False, shell_futures=True)
   2657         -------
   2658         result : :class:`ExecutionResult`
   2659         """
   2660         try:
   2661             result = self._run_cell(
-> 2662                 raw_cell, store_history, silent, shell_futures)
        raw_cell = 'cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)'
        store_history = True
        silent = False
        shell_futures = True
   2663         finally:
   2664             self.events.trigger('post_execute')
   2665             if not silent:
   2666                 self.events.trigger('post_run_cell', result)

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in _run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell='cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)', store_history=True, silent=False, shell_futures=True)
   2780                 self.displayhook.exec_result = result
   2781 
   2782                 # Execute the user code
   2783                 interactivity = 'none' if silent else self.ast_node_interactivity
   2784                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2785                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler object>
   2786                 
   2787                 self.last_execution_succeeded = not has_raised
   2788                 self.last_execution_result = result
   2789 

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Expr object>], cell_name='<ipython-input-148-1acc040c3d6c>', interactivity='last', compiler=<IPython.core.compilerop.CachingCompiler object>, result=<ExecutionResult object at 7f32c78fe940, executi...rue silent=False shell_futures=True> result=None>)
   2904                     return True
   2905 
   2906             for i, node in enumerate(to_run_interactive):
   2907                 mod = ast.Interactive([node])
   2908                 code = compiler(mod, cell_name, "single")
-> 2909                 if self.run_code(code, result):
        self.run_code = <bound method InteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x7f32c73dd270, file "<ipython-input-148-1acc040c3d6c>", line 1>
        result = <ExecutionResult object at 7f32c78fe940, executi...rue silent=False shell_futures=True> result=None>
   2910                     return True
   2911 
   2912             # Flush softspace
   2913             if softspace(sys.stdout, 0):

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x7f32c73dd270, file "<ipython-input-148-1acc040c3d6c>", line 1>, result=<ExecutionResult object at 7f32c78fe940, executi...rue silent=False shell_futures=True> result=None>)
   2958         outflag = True  # happens in more places, so it's easier as default
   2959         try:
   2960             try:
   2961                 self.hooks.pre_run_code_hook()
   2962                 #rprint('Running code', repr(code_obj)) # dbg
-> 2963                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x7f32c73dd270, file "<ipython-input-148-1acc040c3d6c>", line 1>
        self.user_global_ns = {'BaseEstimator': <class 'sklearn.base.BaseEstimator'>, 'CodeMergeNaff': <class '__main__.CodeMergeNaff'>, 'DecisionTreeClassifier': <class 'sklearn.tree.tree.DecisionTreeClassifier'>, 'GTDFilter': <class '__main__.GTDFilter'>, 'GaussianNB': <class 'sklearn.naive_bayes.GaussianNB'>, 'In': ['', "get_ipython().run_line_magic('clear', '')", 'import pandas as pd\nimport numpy as np', 'from sklearn.preprocessing import LabelEncoder\nf... import SMOTE\nfrom scipy.sparse import csr_matrix', 'from sklearn.model_selection import train_test_s...validate, StratifiedKFold, StratifiedShuffleSplit', 'from sklearn.tree import DecisionTreeClassifier\n...om sklearn.linear_model import LogisticRegression', 'from sklearn.metrics import accuracy_score, precision_score, classification_report, roc_auc_score', "get_ipython().run_line_magic('env', 'JOBLIB_TEMP_FOLDER=/tmp')", 'import gc', '# Instead of the excel from their homepage, I us...terrorismdb_0617dist.csv", encoding=\'ISO-8859-1\')', '# In case we want to use a sample\ngtd_ori = gtd\ngtd = gtd.sample(frac=0.1)', "dat = gtd[(gtd.iyear >= 1970) \n    & (gtd.iyear ...', 'nwound', 'nhostkid', 'gname']]    \n\ndat.shape", 'from sklearn.pipeline import Pipeline\nfrom sklearn.base import TransformerMixin , BaseEstimator', "steps_in_val_dict = {\n    'Set location': [1, 1,..._x', 'train_y', 'val_x', 'val_y')).T\nsteps_in_val", 'def showXy(X, y=None):\n        if isinstance(X, ...\n            print("y.shape: {}".format(y.shape))', 'class CodeMergeNaff(TransformerMixin, BaseEstima...t("CodeMergeNaff transform END")\n        return X', 'class ProcPerf(TransformerMixin, BaseEstimator):...lf.y.shape))\n                \n        return X, y', 'class GTDFilter(TransformerMixin, BaseEstimator)...                \n        return X, y #self.y_work', 'class TransWrap(TransformerMixin, BaseEstimator)...format(new_y.shape))\n\n        return new_X, new_y', "crits = {'nkill': (62, 124, 'abc'), \n         'n... 'def'), \n         'nhostkid': (400, 800, 'ghi')}", ...], 'KFold': <class 'sklearn.model_selection._split.KFold'>, 'KNeighborsClassifier': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'LabelEncoder': <class 'sklearn.preprocessing.label.LabelEncoder'>, 'LinearDiscriminantAnalysis': <class 'sklearn.discriminant_analysis.LinearDiscriminantAnalysis'>, ...}
        self.user_ns = {'BaseEstimator': <class 'sklearn.base.BaseEstimator'>, 'CodeMergeNaff': <class '__main__.CodeMergeNaff'>, 'DecisionTreeClassifier': <class 'sklearn.tree.tree.DecisionTreeClassifier'>, 'GTDFilter': <class '__main__.GTDFilter'>, 'GaussianNB': <class 'sklearn.naive_bayes.GaussianNB'>, 'In': ['', "get_ipython().run_line_magic('clear', '')", 'import pandas as pd\nimport numpy as np', 'from sklearn.preprocessing import LabelEncoder\nf... import SMOTE\nfrom scipy.sparse import csr_matrix', 'from sklearn.model_selection import train_test_s...validate, StratifiedKFold, StratifiedShuffleSplit', 'from sklearn.tree import DecisionTreeClassifier\n...om sklearn.linear_model import LogisticRegression', 'from sklearn.metrics import accuracy_score, precision_score, classification_report, roc_auc_score', "get_ipython().run_line_magic('env', 'JOBLIB_TEMP_FOLDER=/tmp')", 'import gc', '# Instead of the excel from their homepage, I us...terrorismdb_0617dist.csv", encoding=\'ISO-8859-1\')', '# In case we want to use a sample\ngtd_ori = gtd\ngtd = gtd.sample(frac=0.1)', "dat = gtd[(gtd.iyear >= 1970) \n    & (gtd.iyear ...', 'nwound', 'nhostkid', 'gname']]    \n\ndat.shape", 'from sklearn.pipeline import Pipeline\nfrom sklearn.base import TransformerMixin , BaseEstimator', "steps_in_val_dict = {\n    'Set location': [1, 1,..._x', 'train_y', 'val_x', 'val_y')).T\nsteps_in_val", 'def showXy(X, y=None):\n        if isinstance(X, ...\n            print("y.shape: {}".format(y.shape))', 'class CodeMergeNaff(TransformerMixin, BaseEstima...t("CodeMergeNaff transform END")\n        return X', 'class ProcPerf(TransformerMixin, BaseEstimator):...lf.y.shape))\n                \n        return X, y', 'class GTDFilter(TransformerMixin, BaseEstimator)...                \n        return X, y #self.y_work', 'class TransWrap(TransformerMixin, BaseEstimator)...format(new_y.shape))\n\n        return new_X, new_y', "crits = {'nkill': (62, 124, 'abc'), \n         'n... 'def'), \n         'nhostkid': (400, 800, 'ghi')}", ...], 'KFold': <class 'sklearn.model_selection._split.KFold'>, 'KNeighborsClassifier': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'LabelEncoder': <class 'sklearn.preprocessing.label.LabelEncoder'>, 'LinearDiscriminantAnalysis': <class 'sklearn.discriminant_analysis.LinearDiscriminantAnalysis'>, ...}
   2964             finally:
   2965                 # Reset our crash handler in place
   2966                 sys.excepthook = old_excepthook
   2967         except SystemExit as e:

...........................................................................
/home/andras/Projects/fs-ai/notebook/<ipython-input-148-1acc040c3d6c> in <module>()
----> 1 cross_val_score(estimator=pipe, X=X, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator=Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]), X=             eventid  iyear  imonth  iday       ...                NaN  

[17035 rows x 134 columns], y=18319                    Nicaraguan Democratic F...o Haram
Name: gname, Length: 17035, dtype: object, groups=None, scoring='accuracy', cv=3, n_jobs=-1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')
    337     cv_results = cross_validate(estimator=estimator, X=X, y=y, groups=groups,
    338                                 scoring={'score': scorer}, cv=cv,
    339                                 return_train_score=False,
    340                                 n_jobs=n_jobs, verbose=verbose,
    341                                 fit_params=fit_params,
--> 342                                 pre_dispatch=pre_dispatch)
        pre_dispatch = '2*n_jobs'
    343     return cv_results['test_score']
    344 
    345 
    346 def _fit_and_score(estimator, X, y, scorer, train, test, verbose,

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator=Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]), X=             eventid  iyear  imonth  iday       ...                NaN  

[17035 rows x 134 columns], y=18319                    Nicaraguan Democratic F...o Haram
Name: gname, Length: 17035, dtype: object, groups=None, scoring={'score': make_scorer(accuracy_score)}, cv=KFold(n_splits=3, random_state=None, shuffle=False), n_jobs=-1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', return_train_score=False)
    201     scores = parallel(
    202         delayed(_fit_and_score)(
    203             clone(estimator), X, y, scorers, train, test, verbose, None,
    204             fit_params, return_train_score=return_train_score,
    205             return_times=True)
--> 206         for train, test in cv.split(X, y, groups))
        cv.split = <bound method _BaseKFold.split of KFold(n_splits=3, random_state=None, shuffle=False)>
        X =              eventid  iyear  imonth  iday       ...                NaN  

[17035 rows x 134 columns]
        y = 18319                    Nicaraguan Democratic F...o Haram
Name: gname, Length: 17035, dtype: object
        groups = None
    207 
    208     if return_train_score:
    209         train_scores, test_scores, fit_times, score_times = zip(*scores)
    210         train_scores = _aggregate_score_dicts(train_scores)

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object cross_validate.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
AttributeError                                     Fri Aug 17 13:37:31 2018
PID: 21903                  Python 3.6.6: /home/andras/anaconda3/bin/python
...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_and_score>, (Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]),              eventid  iyear  imonth  iday       ...                NaN  

[17035 rows x 134 columns], 18319                    Nicaraguan Democratic F...o Haram
Name: gname, Length: 17035, dtype: object, {'score': make_scorer(accuracy_score)}, array([ 5679,  5680,  5681, ..., 17032, 17033, 17034]), array([   0,    1,    2, ..., 5676, 5677, 5678]), 0, None, None), {'return_times': True, 'return_train_score': False})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]),              eventid  iyear  imonth  iday       ...                NaN  

[17035 rows x 134 columns], 18319                    Nicaraguan Democratic F...o Haram
Name: gname, Length: 17035, dtype: object, {'score': make_scorer(accuracy_score)}, array([ 5679,  5680,  5681, ..., 17032, 17033, 17034]), array([   0,    1,    2, ..., 5676, 5677, 5678]), 0, None, None)
        kwargs = {'return_times': True, 'return_train_score': False}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]), X=             eventid  iyear  imonth  iday       ...                NaN  

[17035 rows x 134 columns], y=18319                    Nicaraguan Democratic F...o Haram
Name: gname, Length: 17035, dtype: object, scorer={'score': make_scorer(accuracy_score)}, train=array([ 5679,  5680,  5681, ..., 17032, 17033, 17034]), test=array([   0,    1,    2, ..., 5676, 5677, 5678]), verbose=0, parameters=None, fit_params={}, return_train_score=False, return_parameters=False, return_n_test_samples=False, return_times=True, error_score='raise')
    453 
    454     try:
    455         if y_train is None:
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method Pipeline.fit of Pipeline(memory=No...C', TransWrap(transformer=None, ytrans=False))])>
        X_train =              eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns]
        y_train = 61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object
        fit_params = {}
    459 
    460     except Exception as e:
    461         # Note fit time as time until error
    462         fit_time = time.time() - start_time

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self=Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]), X=             eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], y=61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object, **fit_params={})
    243         Returns
    244         -------
    245         self : Pipeline
    246             This estimator
    247         """
--> 248         Xt, fit_params = self._fit(X, y, **fit_params)
        Xt = undefined
        fit_params = {}
        self._fit = <bound method Pipeline._fit of Pipeline(memory=N...C', TransWrap(transformer=None, ytrans=False))])>
        X =              eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns]
        y = 61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object
    249         if self._final_estimator is not None:
    250             self._final_estimator.fit(Xt, y, **fit_params)
    251         return self
    252 

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self=Pipeline(memory=None,
     steps=[('filter', Tra...TC', TransWrap(transformer=None, ytrans=False))]), X=             eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], y=61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object, **fit_params={})
    208                 else:
    209                     cloned_transformer = clone(transformer)
    210                 # Fit or load from cache the current transfomer
    211                 Xt, fitted_transformer = fit_transform_one_cached(
    212                     cloned_transformer, None, Xt, y,
--> 213                     **fit_params_steps[name])
        fit_params_steps = {'DTC': {}, 'filter': {}, 'naffect recoder': {}, 'perfproc': {}}
        name = 'filter'
    214                 # Replace the transformer of the step with the fitted
    215                 # transformer. This is necessary when loading the transformer
    216                 # from the cache.
    217                 self.steps[step_idx] = (name, fitted_transformer)

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py in __call__(self=NotMemorizedFunc(func=<function _fit_transform_one at 0x7f32ba29a7b8>), *args=(TransWrap(transformer=None, ytrans=True), None,              eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], 61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object), **kwargs={})
    357     # Should be a light as possible (for speed)
    358     def __init__(self, func):
    359         self.func = func
    360 
    361     def __call__(self, *args, **kwargs):
--> 362         return self.func(*args, **kwargs)
        self.func = <function _fit_transform_one>
        args = (TransWrap(transformer=None, ytrans=True), None,              eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], 61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object)
        kwargs = {}
    363 
    364     def call_and_shelve(self, *args, **kwargs):
    365         return NotMemorizedResult(self.func(*args, **kwargs))
    366 

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer=TransWrap(transformer=None, ytrans=True), weight=None, X=             eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], y=61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object, **fit_params={})
    576 
    577 
    578 def _fit_transform_one(transformer, weight, X, y,
    579                        **fit_params):
    580     if hasattr(transformer, 'fit_transform'):
--> 581         res = transformer.fit_transform(X, y, **fit_params)
        res = undefined
        transformer.fit_transform = <bound method TransformerMixin.fit_transform of TransWrap(transformer=None, ytrans=True)>
        X =              eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns]
        y = 61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object
        fit_params = {}
    582     else:
    583         res = transformer.fit(X, y, **fit_params).transform(X)
    584     # if we have a weight for this transformer, multiply output
    585     if weight is None:

...........................................................................
/home/andras/anaconda3/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self=TransWrap(transformer=None, ytrans=True), X=             eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], y=61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object, **fit_params={})
    515         if y is None:
    516             # fit method of arity 1 (unsupervised transformation)
    517             return self.fit(X, **fit_params).transform(X)
    518         else:
    519             # fit method of arity 2 (supervised transformation)
--> 520             return self.fit(X, y, **fit_params).transform(X)
        self.fit = <bound method TransWrap.fit of TransWrap(transformer=None, ytrans=True)>
        X =              eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns]
        y = 61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object
        fit_params.transform = undefined
    521 
    522 
    523 class DensityMixin(object):
    524     """Mixin class for all density estimators in scikit-learn."""

...........................................................................
/home/andras/Projects/fs-ai/notebook/<ipython-input-140-a11be80f2199> in fit(self=TransWrap(transformer=None, ytrans=True), X=             eventid  iyear  imonth  iday       ...                NaN  

[11356 rows x 134 columns], y=61862                                           ...o Haram
Name: gname, Length: 11356, dtype: object)
     20             #print(X.isna().any().any())
     21             #print(X)
     22             self.trans.fit(X, self.y)
     23         else:
     24             #print(X.isna().any().any())
---> 25             self.trans.fit(X, y)
     26         
     27         return self
     28 
     29     def predict(self, X, y=None):

AttributeError: 'NoneType' object has no attribute 'fit'
___________________________________________________________________________

#  Manual solution

In the subset there is a slightly different amount of the unique perpetrator groups from what the authors reported (270) (probably due to update in the database?).

In [None]:
gtd.gname.nunique()

The authors removed those groups which were linked to only one incident. 

In [None]:
dat = dat[dat.gname.isin(dat.gname.value_counts()[dat.gname.value_counts() > 1].index.values)]
dat.shape

In [None]:
dat = dat.copy()
dat.shape

In [None]:
dat.loc[:, ['nkill', 'nwound', 'nhostkid']].describe()

In [None]:
crits = {'nkill': (62, 124, 'abc'), 
         'nwound': (272, 544, 'def'), 
         'nhostkid': (400, 800, 'ghi')}

In [None]:
n = pd.DataFrame(crits)

In [None]:
crits

In [None]:
n

In [None]:
def naff(data, crits):
    n = pd.Series('_')
    for key, i in zip(crits.keys(), range(len(crits))):
        i = data.loc[:, key].copy()

        i[data.loc[:,key] == 0] = 'n'
        i[(data.loc[:,key] > 0) 
          & (data.loc[:,key] < crits[key][0])] = crits[key][2][2]
        i[(data.loc[:,key] <= crits[key][1]) 
          & (data.loc[:,key] >= crits[key][0])] = crits[key][2][1]
        i[data.loc[:,key] > crits[key][1]] = crits[key][2][0]

        n = pd.concat((n, i), axis=1) 

    return n.drop(columns=0)

In [None]:
naffect = naff(dat, crits)
naffect.nhostkid.value_counts(dropna=False)

In [None]:
naf_cri = pd.DataFrame(crits)

def ref_naf(data, crits):   
    s = pd.DataFrame(data.loc[:, crits.columns.values], copy=True)
    print(s.head())

    s[data == 0] = 'n'
    s[(data > 0) & (data < crits.iloc[0,:])] = crits.iloc[2,:].str[2]
    s[(data <= crits.iloc[1,:]) & (data >= crits.iloc[0,:])] = crits.iloc[2,:].str[1]
    s[data > crits.iloc[1,:]] = crits.iloc[2,:].str[0]

    return s

In [None]:
naf_cri.iloc[2,:].str[2]

In [None]:
naffect = ref_naf(dat, naf_cri)
naffect.head(10)

In [None]:
naffect.nhostkid.value_counts()

In [None]:
naffect.nhostkid[naffect.nhostkid == -99] = np.NaN
naffect.replace(np.NaN, 'n', inplace=True)

In [None]:
naffect = naffect.iloc[:,0] +  naffect.iloc[:,1] +  naffect.iloc[:,2]

In [None]:
naffect.value_counts()

In [None]:
dat.drop(columns=['nkill', 'nwound', 'nhostkid'], inplace=True)
dat['naffect'] = naffect

In [None]:
dat.nperps.where(dat.nperps != -99, 0, inplace=True)
dat.nperps.fillna(0, inplace=True)

In [None]:
dat.isna().any().any()

## Defining the column datatypes
We try to gain some performance in the forthcoming operations by defining the column datatypes.

In [None]:
dat.info(memory_usage=True)

In [None]:
dat.describe()

In [None]:
dat.loc[:,['imonth', 'iday', 'extended']] = dat.loc[:,['imonth', 'iday', 'extended']].astype('int8', copy=False)
dat.loc[:,['iyear', 'nperps']] = dat.loc[:,['iyear', 'nperps']].astype('int16', copy=False)
dat.loc[:,dat.select_dtypes(object).columns.values] = dat.loc[:,dat.select_dtypes(object).columns.values].astype('category', copy=False)

In [None]:
dat.dtypes

In [None]:
dat.info(memory_usage=True)

## Transforming the categoricals

In [None]:
dat.gname.value_counts()

In [None]:
X = pd.get_dummies(dat.drop(columns='gname'))
X = csr_matrix(X)
#y = pd.get_dummies(dat.gname)

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(dat.gname)
y = y.astype('int16')

print(X.shape)
print(y.shape)

In [None]:
smote = SMOTE(ratio='all', k_neighbors=1, n_jobs=-1)

In [None]:
#%%time
tXr, tyr = smote.fit_sample(X, y) 
#tXr, tyr = X, y

In [None]:
print(X.shape)
#print(X.nbytes().sum() / 1024)
print(tXr.shape)
# print(tXr.nbytes / 1024)

print(y.shape)
print(y.nbytes / 1024)
print(tyr.shape)
print(tyr.nbytes / 1024)

In [None]:
np.array([np.unique(y, return_counts=True)[0], np.unique(y, return_counts=True)[1]]).T

In [None]:
np.array([np.unique(tyr, return_counts=True)[0], np.unique(tyr, return_counts=True)[1]]).T

# Modeling

In [None]:
validation_size = 0.2
seed = 17

In [None]:
X_train, X_validation, y_train, y_validation = train_test_split(tXr, tyr, test_size=validation_size, random_state=seed)
print(X_train.shape)
print(X_validation.shape)
print(y_train.shape)
print(y_validation.shape)

del tXr
del tyr

gc.collect()

In [None]:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)

In [None]:
kfold = StratifiedKFold(n_splits=10, random_state=seed)
#kfold = KFold(n_splits=10, random_state=seed)

In [None]:
models = {"Decisiong Tree Classifier": DecisionTreeClassifier(),
          # "K-Neighbors Classifier": KNeighborsClassifier(),
          # "Gaussian Naive Bayes": GaussianNB(),
          # "Linear Discriminant Analysis": KNeighborsClassifier(),
          # "Logistic Regression": LogisticRegression(),
         }

In [None]:
def eval_models(models, X, y):
    """Evaluates selected model's prediction power on the cross-validated training datasets.
    Takes
        models: Dictionary of "model_name": model() pairs.
        X: predictor attributes
        y: target attribute
    """
    results = []
    for model in models:
        #print("Running {}...".format(model))
        #start = time.time()

        result = []
        result.append(model)

        model_score = cross_validate(models[model],
                                    X,
                                    y,
                                    scoring=['accuracy', # Evaluation metrics
                                             'precision_micro',
                                             'recall_micro',
                                             'f1_micro',
                                            # 'roc_auc'
                                            ],
                                    cv=kfold, # Cross-validation method
                                    n_jobs=-1,
                                    verbose=0,
                                    return_train_score=False)

        acc_mean = model_score['test_accuracy'].mean()
        acc_std = model_score['test_accuracy'].std()
        #auc_mean = model_score['test_roc_auc'].mean()
        #auc_std = model_score['test_roc_auc'].std()

        print("\n{}:\n\tAccuracy: {} ({})".format(model, acc_mean, acc_std)) #auc_std

        #print("\tROC AUC: {} ({})".format(auc_mean, auc_std))

        precision_micro_mean = model_score['test_precision_micro'].mean()
        precision_micro_std = model_score['test_precision_micro'].std()
        recall_micro_mean = model_score['test_recall_micro'].mean()
        recall_micro_std = model_score['test_recall_micro'].std()

        f1_micro_mean = model_score['test_f1_micro'].mean()
        f1_micro_std = model_score['test_f1_micro'].std()
        print("\tF1 micro: {} ({})".format(f1_micro_mean, f1_micro_std))

        #result = result + [acc_mean, acc_std, auc_mean, auc_std]

        dur = model_score['fit_time'].sum() + model_score['score_time'].sum()

        print("\tduration: {}\n".format(dur))
        #result.append(dur)

        #results.append(result)

In [None]:
% time
eval_models(models, X_train, y_train)

In [None]:
model = DecisionTreeClassifier().fit(X_train, y_train)

In [None]:
preprob = model.predict_proba(X_validation)
preprob.shape

In [None]:
preds = model.predict(X_validation)

In [None]:
accuracy_score(y_validation, preds)

In [None]:
preprob.shape

In [None]:
roc_auc_score(y_validation, preprob)