# 01 GTD Preprocessing

Let's check out `snorkel` using Kaggle's subset of the Global Terrorism Database. To make it more interesting than a basic classification problem, we'll try to identify the perpetrator group name.

We'll preprocess in a few steps to map this back to a binary classifier:

* Extract entities from each sentence using `spacy`
* Create a record for every sentence, for every entity in that sentence:
  * build a network input that annotates the entity:
  
 `I bless the rains down in ENTSTART Africa ENTEND`
 
* Then we'll build `snorkel` labeling functions to predict whether the annotated entity is the group responsible for the terrorist attack. (second notebook)
* Finally we'll use the probabilistic labels from `snorkel` to train a neural network

**One thing to note:** sometimes (often?) the group name is recorded differently than it appears in the test. But we'll worry about that problem later (never).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import spacy
from tqdm import tqdm

%matplotlib inline
from IPython.core.pylabtools import figsize

## Load the pieces

* load the `spacy` large core english model
* use `pandas` to load the GTD

In [2]:
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_lg")

In [3]:
rawfile = "/media/joe/data/gtd/globalterrorismdb_0718dist.csv"
df = pd.read_csv(rawfile, encoding="latin1")
len(df)

  interactivity=interactivity, compiler=compiler, result=result)


181691

In [4]:
# discard anything without text
df = df[pd.notnull(df.summary)]
len(df)

115562

In [5]:
df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
5,197001010002,1970,1,1,,0,,217,United States,1,...,"The Cairo Chief of Police, William Petersen, r...","""Police Chief Quits,"" Washington Post, January...","""Cairo Police Chief Quits; Decries Local 'Mili...","Christopher Hewitt, ""Political Violence and Te...",Hewitt Project,-9,-9,0,-9,
7,197001020002,1970,1,2,,0,,217,United States,1,...,"Damages were estimated to be between $20,000-$...",Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,
8,197001020003,1970,1,2,,0,,217,United States,1,...,The New Years Gang issue a communiqué to a loc...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...","The Wisconsin Cartographers' Guild, ""Wisconsin...",Hewitt Project,0,0,0,0,
9,197001030001,1970,1,3,,0,,217,United States,1,...,"Karl Armstrong's girlfriend, Lynn Schultz, dro...",Committee on Government Operations United Stat...,"Tom Bates, ""Rads: The 1970 Bombing of the Army...","David Newman, Sandra Sutherland, and Jon Stewa...",Hewitt Project,0,0,0,0,
11,197001060001,1970,1,6,,0,,217,United States,1,...,,Committee on Government Operations United Stat...,"Christopher Hewitt, ""Political Violence and Te...",,Hewitt Project,-9,-9,0,-9,


### Pulling entities and (semi-accurate) metadata with `spacy`

In [9]:
df.summary.values[2]

'1/2/1970: Karl Armstrong, a member of the New Years Gang, threw a firebomb at R.O.T.C. offices located within the Old Red Gym at the University of Wisconsin in Madison, Wisconsin, United States.  There were no casualties but the fire caused around $60,000 in damages to the building.'

In [10]:
doc = nlp(df.summary.values[2])

In [11]:
doc.ents

(1/2/1970,
 Karl Armstrong,
 the New Years Gang,
 R.O.T.C.,
 the Old Red Gym,
 the University of Wisconsin,
 Madison,
 Wisconsin,
 United States,
 around $60,000)

In [12]:
df.gname.values[2]

"New Year's Gang"

In [13]:
doc.ents[2].label_

'DATE'

In [14]:
doc.ents[2].orth_

'the New Years Gang'

In [15]:
def _build_ent_dict(e):
    outdict = {}
    outdict["name"] = e.orth_.lower()
    if outdict["name"].startswith("the"):
        outdict["name"] = outdict["name"][4:]
    outdict["type"] = e.label_
    outdict["start"] = e.start_char
    outdict["end"] = e.end_char
    return outdict

In [16]:
for e in doc.ents:
    print(_build_ent_dict(e))

{'name': '1/2/1970', 'type': 'CARDINAL', 'start': 0, 'end': 8}
{'name': 'karl armstrong', 'type': 'PERSON', 'start': 10, 'end': 24}
{'name': 'new years gang', 'type': 'DATE', 'start': 38, 'end': 56}
{'name': 'r.o.t.c.', 'type': 'GPE', 'start': 78, 'end': 86}
{'name': 'old red gym', 'type': 'ORG', 'start': 110, 'end': 125}
{'name': 'university of wisconsin', 'type': 'ORG', 'start': 129, 'end': 156}
{'name': 'madison', 'type': 'GPE', 'start': 160, 'end': 167}
{'name': 'wisconsin', 'type': 'GPE', 'start': 169, 'end': 178}
{'name': 'united states', 'type': 'GPE', 'start': 180, 'end': 193}
{'name': 'around $60,000', 'type': 'MONEY', 'start': 241, 'end': 255}


In [17]:
foo = df.iloc[0]

In [18]:
foo.eventid

197001010002

Write a function to map each GTD record to zero or more binary classification tasks. We'll hold on to the ground truth for each one so that we can tell whether anything I build actually works.

In [19]:
sentences = [" "+x.strip()+" " for x in foo.summary.split(".")]
sentences

[' 1/1/1970: Unknown African American assailants fired several bullets at police headquarters in Cairo, Illinois, United States ',
 ' There were no casualties, however, one bullet narrowly missed several police officers ',
 ' This attack took place during heightened racial tensions, including a Black boycott of White-owned businesses, in Cairo Illinois ',
 '  ']

In [20]:
d1 = _build_ent_dict(nlp(sentences[0]).ents[0])
d1

{'name': '1/1/1970', 'type': 'DATE', 'start': 1, 'end': 9}

In [21]:
def _label_sentence(s, d):
    outstr = s[:d["start"]]
    outstr += " ENTSTART "
    outstr += s[d["start"]:d["end"]]
    outstr += " ENTEND "
    outstr += s[d["end"]:]
    return outstr.replace("  ", " ")

In [22]:
_label_sentence(sentences[0], d1)

' ENTSTART 1/1/1970 ENTEND : Unknown African American assailants fired several bullets at police headquarters in Cairo, Illinois, United States '

In [23]:
def _prep_row(r):
    sentences = [" "+x.strip()+" " for x in r.summary.split(".")]
    sentlist = []
    for s in sentences:
        doc = nlp(s)
        for e in doc.ents:
            if e.label_ not in ["DATE", "MONEY", "ORDINAL",
                               "WORK_OF_ART", "TIME", "QUANTITY",
                               "PERCENT", ]:
                d = _build_ent_dict(e)
                d["labeled_sentence"] = _label_sentence(s,d)
                d["sentence"] = s
                label = int(d["name"] in " ".join([str(r.gname).lower(),
                                str(r.gname2).lower(),
                                str(r.gname3).lower()]).replace("'",""))
                d["label"] = label
                d["eventid"] = r.eventid
                sentlist.append(d)
    return sentlist

In [24]:
%%time
all_records = []
for i,r in df.iterrows():
    all_records += _prep_row(r)

prep_df = pd.DataFrame(all_records)

CPU times: user 5h 10min 28s, sys: 7min 9s, total: 5h 17min 38s
Wall time: 1h 46min 8s


In [26]:
prep_df.head()

Unnamed: 0,end,eventid,label,labeled_sentence,name,sentence,start,type
0,35,197001010002,0,1/1/1970: Unknown ENTSTART African American E...,african american,1/1/1970: Unknown African American assailants...,19,NORP
1,100,197001010002,0,1/1/1970: Unknown African American assailants...,cairo,1/1/1970: Unknown African American assailants...,95,GPE
2,110,197001010002,0,1/1/1970: Unknown African American assailants...,illinois,1/1/1970: Unknown African American assailants...,102,GPE
3,125,197001010002,0,1/1/1970: Unknown African American assailants...,united states,1/1/1970: Unknown African American assailants...,112,GPE
4,39,197001010002,0,"There were no casualties, however, ENTSTART o...",one,"There were no casualties, however, one bullet...",36,CARDINAL


In [27]:
len(all_records)

844782

In [127]:
prep_df = pd.DataFrame(all_records)
len(prep_df)

237494

In [126]:
prep_df.head()

Unnamed: 0,name,type,start,end,labeled_sentence,sentence,label,eventid
0,african american,NORP,19,35,1/1/1970: Unknown ENTSTART African American E...,1/1/1970: Unknown African American assailants...,0,197001010002
1,cairo,GPE,95,100,1/1/1970: Unknown African American assailants...,1/1/1970: Unknown African American assailants...,0,197001010002
2,illinois,GPE,102,110,1/1/1970: Unknown African American assailants...,1/1/1970: Unknown African American assailants...,0,197001010002
3,united states,GPE,112,125,1/1/1970: Unknown African American assailants...,1/1/1970: Unknown African American assailants...,0,197001010002
4,one,CARDINAL,36,39,"There were no casualties, however, ENTSTART o...","There were no casualties, however, one bullet...",0,197001010002


In [128]:
prep_df.label.value_counts()

0    214454
1     23040
Name: label, dtype: int64

In [129]:
prep_df.type.value_counts()

GPE            83318
CARDINAL       47526
ORG            45471
PERSON         24880
NORP           17846
TIME            6886
LOC             3682
FAC             3070
QUANTITY        2980
PRODUCT          972
WORK_OF_ART      386
EVENT            238
LAW              142
LANGUAGE          79
PERCENT           18
Name: type, dtype: int64

In [134]:
prep_df.name[prep_df.type == "PRODUCT"].value_counts()

corsica                        214
kasab                           25
conspiracy of cells of fire     13
katargam                        13
windows                         12
                              ... 
chagma                           1
itum-kale                        1
bm1                              1
mayors                          1
town councilor                   1
Name: name, Length: 544, dtype: int64

In [25]:
prep_df.to_csv("groupname_preprocessed_full.csv", index=False)