**Progress summary:**

Tried logistic regression and random forests to predict success. Logistic regression predicted all attacks as successful; random forests did predict some failures but had low recall for failures.

Also tried logistic regression, random forests, and SVM for predicting whether an attack results in any fatalities. Accuracy scores for all were 70-something percent (in dataset, 45% of attacks result in fatalities). Overfitting does not seem to be a problem right now, so adding more explanatory variables to boost accuracy might help.

**to do/questions:**
    
how to improve recall for unsuccessful attacks (how to give greater weight to recall for unsuccessful attacks when tuning)

how to run SVC faster to properly tune it

replace success with fatality rate or another metric? improving recall just means trying to pin down the small proportion of attempted attacks that aren't successful. But when most attacks in the dataset are successful, identifying failures doesn't matter that much. In this context, identifying a failure is less important than identifying successes (while if it were the other way around, 90% failure and 10% success, improving recall for success would be very important)


In [109]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv('cleaned.csv')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122376 entries, 0 to 122375
Data columns (total 24 columns):
Unnamed: 0     122376 non-null int64
iyear          122376 non-null int64
imonth         122376 non-null int64
region         122376 non-null int64
crit1          122376 non-null int64
crit2          122376 non-null int64
crit3          122376 non-null int64
success        122376 non-null int64
suicide        122376 non-null int64
attacktype1    122376 non-null float64
attacktype2    3533 non-null float64
attacktype3    214 non-null float64
targtype1      122376 non-null float64
targtype2      6685 non-null float64
targtype3      703 non-null float64
individual     122376 non-null int64
weaptype1      122376 non-null float64
weaptype2      8547 non-null float64
weaptype3      1127 non-null float64
weaptype4      63 non-null float64
nkill          122376 non-null float64
nwound         122376 non-null float64
property       122376 non-null float64
propextent     37549 non-null 

In [80]:
explanatory_vars = ['region','crit1','crit2','crit3','suicide','attacktype1','targtype1','individual','weaptype1']

Xtrain, Xtest, ytrain, ytest = train_test_split(df[explanatory_vars].values, df['success'], random_state = 42, test_size = 0.2)

Xtraining, Xholdout, ytraining, yholdout = train_test_split(Xtrain, ytrain, random_state = 42, test_size = 0.2)

print(type(Xtrain), type(ytrain))

<class 'numpy.ndarray'> <class 'pandas.core.series.Series'>


In [81]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

#does cross-validation and computes average score for the 5 folds
def cv_score(clf, x, y, score_func=accuracy_score):
    result = 0
    nfold = 5
    for train, test in KFold(nfold, random_state = 42).split(x): # split data into train/test groups, 5 times
        clf.fit(x[train], y.iloc[train]) # fit
        result += score_func(clf.predict(x[test]), y.iloc[test]) # evaluate score function on held-out data
    return result / nfold # average

Using logistic regression to predict success:

In [82]:
Log_clf = LogisticRegression()

#ytrain = ytrain.reset_index()

print(cv_score(Log_clf, Xtrain, ytrain))

0.899305413687


In [73]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Log_clf.fit(Xtraining, ytraining)
print(confusion_matrix(yholdout, Log_clf.predict(Xholdout), labels = [1,0]))
print(classification_report(yholdout, Log_clf.predict(Xholdout)))

[[17591     0]
 [ 1989     0]]
             precision    recall  f1-score   support

          0       0.00      0.00      0.00      1989
          1       0.90      1.00      0.95     17591

avg / total       0.81      0.90      0.85     19580



  'precision', 'predicted', average, warn_for)


In [22]:
print(len(Log_clf.predict(Xholdout)), np.sum(Log_clf.predict(Xholdout)))

19580 19580


The logistic regression classifies every attack as a success. The accuracy score, and unweighted F1 score, are technically high but this doesn't seem very meaningful. Now I'll try regularization:

In [70]:
Cs = [0.001, 0.1, 1, 10, 100]

for c in Cs:
    reg_clf = LogisticRegression(C = c)
    reg_clf.fit(Xtraining, ytraining)
    print(c)
    print(confusion_matrix(yholdout, reg_clf.predict(Xholdout), labels = [1,0]))

#for every regularization parameter, still predicts everything as success

0.001
[[17591     0]
 [ 1989     0]]
0.1
[[17591     0]
 [ 1989     0]]
1
[[17591     0]
 [ 1989     0]]
10
[[17591     0]
 [ 1989     0]]
100
[[17591     0]
 [ 1989     0]]


The classifier still classifies everything as a success. Not sure how to tune logistic regression to try to predict negatives.

Next, random forests:

In [106]:
from sklearn.ensemble import RandomForestClassifier


Rf_clf = RandomForestClassifier(random_state = 42)

Rf_clf.fit(Xtraining, ytraining)

print(confusion_matrix(yholdout, Rf_clf.predict(Xholdout),labels=[1,0]))

print(f1_score(yholdout, Rf_clf.predict(Xholdout)))

[[17427   164]
 [ 1722   267]]
0.948666303756


In [83]:
print(classification_report(yholdout, Rf_clf.predict(Xholdout)))
print(type(classification_report(yholdout, Rf_clf.predict(Xholdout))))

             precision    recall  f1-score   support

          0       0.60      0.13      0.22      1989
          1       0.91      0.99      0.95     17591

avg / total       0.88      0.90      0.87     19580

<class 'str'>


The random forest predicts some failures, but the recall is still very low (0.13) for unsuccessful attacks.

In [88]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

param_grid = {'max_depth':[3,5,10,20,100], 'min_impurity_decrease':[1e-7,1e-6,1e-5, 1e-4, 1e-3, 1e-2]}
scorer = make_scorer(f1_score)
Rf_clf = RandomForestClassifier()
Rf_clf_cv = GridSearchCV(Rf_clf, param_grid, cv = 5, scoring = scorer)
Rf_clf_cv.fit(Xtrain, ytrain)

print(Rf_clf_cv.best_params_)

{'max_depth': 10, 'min_impurity_decrease': 1e-06}


In [90]:
Rf_clf_tuned = RandomForestClassifier(max_depth = 10, min_impurity_decrease = 1e-6)

Rf_clf_tuned.fit(Xtrain,ytrain)
print(classification_report(ytrain, Rf_clf_tuned.predict(Xtrain)))

             precision    recall  f1-score   support

          0       0.68      0.12      0.20      9858
          1       0.91      0.99      0.95     88042

avg / total       0.89      0.91      0.87     97900



Tuning does not make much of a difference.

Moving to fatality rate, which is closer to 50/50 which makes interpreting metrics easier

In [91]:
df['fatal'] = df['nkill'] > 0
print(df['fatal'].sum()/df['fatal'].count())

0.453046348957


In [93]:
Xtrain_f, Xtest_f, ytrain_f, ytest_f = train_test_split(df[explanatory_vars].values, df['fatal'], random_state = 42, test_size = 0.2)

Log_clf_f = LogisticRegression()

print(cv_score(Log_clf_f, Xtrain_f, ytrain_f))

0.738804902962


In [95]:
Log_clf_f = LogisticRegression()
Log_clf_f.fit(Xtrain_f, ytrain_f)
print(classification_report(ytrain_f, Log_clf_f.predict(Xtrain_f)))
print(accuracy_score(ytrain_f, Log_clf_f.predict(Xtrain_f)))

             precision    recall  f1-score   support

      False       0.74      0.81      0.77     53654
       True       0.74      0.65      0.69     44246

avg / total       0.74      0.74      0.74     97900

0.73797752809


Accuracy score on training data is approximately the same as average cross-validation score, so there is no overfitting even without regularization.

In [96]:
from sklearn.ensemble import RandomForestClassifier


Rf_clf_f = RandomForestClassifier()

print(cv_score(Rf_clf_f, Xtrain_f, ytrain_f))

0.766884576098


In [97]:
print(classification_report(ytrain_f, Rf_clf_f.predict(Xtrain_f)))
print(accuracy_score(ytrain_f, Rf_clf_f.predict(Xtrain_f)))

             precision    recall  f1-score   support

      False       0.81      0.78      0.79     53654
       True       0.74      0.77      0.76     44246

avg / total       0.78      0.78      0.78     97900

0.775863125638


Average cross-validated score is only slightly worse than the accuracy on training data, so there is little overfitting.

In [102]:
param_grid = {'max_depth':[3,5,10,20,100], 'min_impurity_decrease':[1e-7,1e-6,1e-5, 1e-4, 1e-3, 1e-2]}
scorer = make_scorer(f1_score)
Rf_clf_f = RandomForestClassifier()
Rf_clf_fcv = GridSearchCV(Rf_clf_f, param_grid, cv = 5, scoring = scorer)
Rf_clf_fcv.fit(Xtrain_f, ytrain_f)

print(Rf_clf_fcv.best_params_)

{'max_depth': 100, 'min_impurity_decrease': 1e-07}


In [104]:
Rf_clf_f_tuned = RandomForestClassifier(max_depth = 100, min_impurity_decrease = 1e-07)

Rf_clf_f_tuned.fit(Xtrain_f,ytrain_f)

print(cv_score(Rf_clf_f_tuned, Xtrain_f, ytrain_f))

0.76759959142


Support vector machines:

In [108]:
from sklearn.svm import SVC

Sv_clf_f = SVC()

Sv_clf_f.fit(Xtrain_f, ytrain_f)

print(accuracy_score(ytrain_f, Sv_clf_f.predict(Xtrain_f)))
#accuracy is 0.76326. This takes a long time (>10 minutes) to run

0.763258426966


In [None]:
from sklearn.svm import SVC

Sv_clf_f = SVC()

#this takes forever...
print(cv_score(Sv_clf_f, Xtrain_f, ytrain_f))