# Data Preprocessing using RegEx as rules

The aim of this project is to find names of people killed by the police in a corpus of news paper articles. The corpus was created by Katherine A. Keith et al. (2017) for a similar task using distant supervision. This dataset contains mentions of people (based on keywords related to “killing” or “police”) who might have been killed by the police. The dataset (the HTML documents scraped in 2016 themselves as well as the already sentence-segmented data) are available on the [project’s website](http://slanglab.cs.umass.edu/PoliceKillingsExtraction/) and on [MinIO]( https://knodle.dm.univie.ac.at/minio/knodle/datasets/police_killing/). 
There is a train and a test dataset, both of them containing dictionaries with the following keys:

-	docid: unique identifiers of every mention of a person possible killed by the police
-	name: the normalized name of the person
-	downloadtime: time the document was downloaded
-	names_org: the original name of the person mentioned in the document
-	sentnames: other names in the mention (not of the person possibly killed by the police)
-	sent_alter: the mention, name of the person possible killed by the policed replaced by “TARGET”, any other names replaced by “POLICE”
-	plabel: for the training data possibly erroneous labels obtained using weak supervision and gold labels for the test data – in this project, only the labels of the test data will be used
-	sent_org: the original mention


The rules used in this notebook are slightly more complicated RegEx than the wordpairs for the simple rules. The RegEx used as rules should cover all possible ways a sentence can express that a person "TARGET" was killed by the police (using different words for killing and police as well as active and passive constructions).


Reference: Keith, Kathrine A. et al. (2017): Identifying civilians killed by police with distantly supervised entity-event extraction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. doi: [10.18653/v1/D17-1163](https://aclanthology.org/D17-1163/)

## Imports

In [1]:
import pandas as pd
import json
import os
import numpy as np
import re
import scipy.sparse as sp
from tqdm import tqdm
from pathlib import Path
from joblib import dump
from typing import List

## Getting the data

First of all, the file names for the output at the end of this notebook are defined. After that, the raw data can be downloaded from MinIO.

In [2]:
# define the files names
Z_MATRIX_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_TEST = "test_rule_matches_z.lib"

T_MATRIX_TRAIN = "mapping_rules_labels_t.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

# file names for .csv files
TRAIN_SAMPLES_CSV = "df_train.csv"
DEV_SAMPLES_CSV = "df_dev.csv"
TEST_SAMPLES_CSV = "df_test.csv"

# define the path to the folder where the data will be stored
#data_path = "../../../data/police_killing"
data_path = "C:/Users/Emilie/Uni/2021WS/DS_project/data/police_killing"
os.makedirs(data_path, exist_ok=True)
os.path.join(data_path)

'C:/Users/Emilie/Uni/2021WS/DS_project/data/police_killing'

**Getting the Train data**

We read the downloaded data and convert it to a Pandas Dataframe. For now, we take only the samples for the train data and the samples as well as the labels for the test data. In the end, we will also need the name of the person in case it turns out they were killed by the police. However, in this step their name should be replaced by the TARGET symbol. Therefore, we only take the values for the "sent_alter" key and rename them to "samples".

In [3]:
# later first downloaded from Minio

def get_train_data(data_path):
    with open(os.path.join(data_path, "train.json"), 'r') as data:
        train_data = [json.loads(line) for line in data] #a list of dicts
    df_train_sent_alter = pd.DataFrame(train_data, columns = ["sent_alter"]).rename(columns={"sent_alter": "samples"})
    return df_train_sent_alter

df_train = get_train_data(data_path)

In [4]:
df_train.head()

Unnamed: 0,samples
0,"Two years earlier , Officer TARGET was killed ..."
1,Police Chief PERSON said Randolph was found sh...
2,"In the latest incident , Chief Superintendent ..."
3,Chief TARGET of Penn Township police entered t...
4,A man was was fatally shot by a police officer...


**Getting the Development and Test Data**

Since the [SLANG Lab](http://slanglab.cs.umass.edu/PoliceKillingsExtraction/) only provides train and test data, but no development data, the test data must be split in order to be able to use some of it for develoment and some of it for testing. The samples for the develoment data will be selected randomly to avoid imbalances of positive and negative sample in dev and test data.

The parameter *used_as_dev* is the amount of the gold labeled data that should be used for develoment instead of testing. It is set to 30% for now, but can be changed depending on the application of the data.

In [5]:
used_as_dev = 30
print(f"{used_as_dev}% of the test data will be used for develoment.")

30% of the test data will be used for develoment.


In [6]:
def get_dev_test_data(data_path):
    with open(os.path.join(data_path, "test.json"), 'r') as data:
        dev_test_data = [json.loads(line) for line in data]
    dev_test_sent_alter = pd.DataFrame(dev_test_data, columns = ["sent_alter", "plabel"]).rename(columns={"sent_alter": "samples", "plabel": "label"})
    df_dev = dev_test_sent_alter.sample(n = int(round((dev_test_sent_alter.shape[0]/100)*used_as_dev))).reset_index(drop = True)
    df_test = dev_test_sent_alter.drop(df_dev.index).reset_index(drop = True)
    return df_dev, df_test

In [7]:
df_dev, df_test = get_dev_test_data(data_path)

print(f"Number of samples:\n\nTrain data: {df_train.shape[0]}\nDevelopment data: {df_dev.shape[0]}\nTest data: {df_test.shape[0]}")

Number of samples:

Train data: 132833
Development data: 20678
Test data: 48247


In [8]:
df_dev.head()

Unnamed: 0,samples,label
0,"The Show Low officer , TARGET , died in a near...",0
1,"Just after 7:30 a.m. Sunday , officers with th...",0
2,Police said Hodzic had links to the radical Is...,0
3,PERSON TARGET ReporterSun Sentinel Convicted f...,0
4,Authorities have not identified the gunman who...,1


In [9]:
df_test.head()

Unnamed: 0,samples,label
0,[ ] TARGET / Chicago Tribune Lake County Major...,0
1,Round Lake police shooting Round Lake police s...,0
2,PERSON shooting PERSON shooting TARGET / Chica...,0
3,Scene of Round Lake police shooting Scene of R...,0
4,involved shooting TARGET / Chicago Tribune The...,0


### Classes

Since the goal is to find out whether a sentence describes the killing of a person by the police or does not, it is a binary classification task and there are only two classes. The number of classes is defined as the num_classes.

In [10]:
num_classes = 2

## Getting the rules

These word lists are mainly based on the lists of Keith et al. (2017, p. 11). However, here they are split into several different lists to create more precise RegEx. A rule must contain a police word, a killing word and in case the killing word is a shooting word, also a fatality word (due to the fact that just because someone is shot it does not necessarily mean they die). The different constructions make sure the words do not just appear in a random order in a senctence, but in a way the sentence can actually mean that the TARGET was killed by the police.

In [11]:
police_words = ['police', 'officer', 'officers', 'cop', 'cops', 'detective', 'sheriff', 'policeman', 'policemen',
                'constable', 'patrolman', 'sergeant', 'detectives', 'patrolmen', 'policewoman', 'constables',
                'trooper', 'troopers', 'sergeants', 'lieutenant', 'deputies', 'deputy']

killing_words_active = ['shot', 'shoots', 'shoot', 'shooting', 'shots', 'killed', 'kill', 'kills', 'killing', 'murder', 'murders']

killing_words_passive = ['hit', 'shot', 'killed', 'murdered']

shooting_words = ['shot', 'shoots', 'shoot', 'shooting', 'shots']

fatality_words = ['fatal', 'fatally', 'died', 'killed', 'killing', 'dead', 'deadly', 'homicide', 'homicides']

We start by creating a dictionary with all the rules and their corresponding rule IDs.

In [12]:
def creating_rules() -> dict:
    
    rule2rule_id = dict({})
    rule_id = 0
    
    for police_word in police_words: 
        
        for killing_word_active in killing_words_active:
            if killing_word_active not in shooting_words:
                a1 = f"{police_word}.*{killing_word_active}.*target"
                rule2rule_id[a1] = rule_id
                rule_id += 1
            else:
                for fatality_word in fatality_words:
                    a2 = f"{police_word}.*{killing_word_active}.*target.*{fatality_word}"
                    rule2rule_id[a2] = rule_id
                    rule_id += 1
                    a3 = f"{police_word}.*{fatality_word}.*{killing_word_active}.*target"
                    rule2rule_id[a3] = rule_id
                    rule_id += 1
                    a4 = f"{police_word}.*{killing_word_active}.*{fatality_word}.*target"
                    rule2rule_id[a4] = rule_id
                    rule_id += 1
                    

        for killing_word_passive in killing_words_passive:
            if killing_word_passive not in shooting_words:
                p1 = f"target.*{killing_word_passive}.*by.*{police_word}"
                rule2rule_id[p1] = rule_id
                rule_id += 1
            else: 
                for fatality_word in fatality_words:
                    p2 = f"target.*{killing_word_passive}.*{fatality_word}.*by.*{police_word}"
                    rule2rule_id[p2] = rule_id
                    rule_id += 1
                    p3 = f"target.*{fatality_word}.*{killing_word_passive}.*by.*{police_word}"
                    rule2rule_id[p3] = rule_id
                    rule_id += 1
                    p4 = f"target.*{killing_word_passive}.*by.*{police_word}.*{fatality_word}"
                    rule2rule_id[p4] = rule_id
                    rule_id += 1
                        
    return(rule2rule_id)

In [13]:
rule2rule_id = creating_rules()
print(f"There are {len(rule2rule_id)} rules.")

There are 3762 rules.


Secondly, we create a dictionary assigning all rules to their label. There are only two classes (someone was killed by the police or was not killed by the police). Since there are no rules indicating that someone was **not** killed by the police, all rules indicate the positive class 1. Therefore, all values of the rule2label dictionary, containing the rule IDs as keys, can be set to 1.

In [14]:
rule2label = {rule_id: 1 for rule_id in rule2rule_id.values()}

Thirdly, a dictionary mapping the labels to their ID as well as a dictionary mapping the ID to the corresponding label are required. As there are only two classes, this can be done manually. 

In [15]:
label2label_id ={"negative":0, "positive":1}
label_id2label = {0: "negative", 1: "positive"}

## building the T matrix


The rows of the T matrix are the rules and the columns the classes. The T matrix is one-hot encoded. (1 for a rule and its corresponding class.) It can be imported.

In [16]:
#mapping to t matrix (I took this function from the TAC tutorial, still has to be imported from separate script)

def get_mapping_rules_labels_t(rule2label, num_classes):
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    mapping_rules_labels_t = np.zeros([len(rule2label), num_classes])
    for rule, labels in rule2label.items():
        mapping_rules_labels_t[rule, labels] = 1
    return mapping_rules_labels_t

mapping_rules_labels_t = get_mapping_rules_labels_t(rule2label, num_classes)

## Building the Z matrix

### Getting the train data. 
*(Fastest solution I could find, but takes still quite long.)*

We start by creating a list of dictionaries (one for each sample, later they will be the rows in the dataframe). They contain the sample itself as well as list of the matching rules and the corresponding rule IDs. In the first step, the lists are still empty. After that, we want to populate these empty lists. We take each rule and apply it to each sample. If it matches, the rule and the rule IDs are added to the correct dictionary. In the end, the list of dictionaries can be converted into a Pandas Dataframe. 

In [17]:
def get_data_dicts(data: pd.DataFrame, rule2rule_id: dict) -> list:

    data_dicts_empty = []

    for sample in data["samples"].drop_duplicates():
        data_dict = dict({})
        data_dict["samples"] = sample
        data_dict["rules"] = []
        data_dict["enc_rules"] = []
        
        data_dicts_empty.append(data_dict)
        
    return data_dicts_empty


def get_data_for_dicts(data_dicts: list) -> list:

    for rule, rule_id in tqdm(rule2rule_id.items()):
        for data_dict in data_dicts:
            sample = data_dict["samples"]
            if re.search(rule, sample.lower()):
                data_dict["rules"].append(rule)
                data_dict["enc_rules"].append(rule_id)
                
    return data_dicts


def get_df(data: pd.DataFrame, rule2rule_id: dict) -> pd.DataFrame:
    
    data_dicts_empty = get_data_dicts(data, rule2rule_id)
    data_dicts = get_data_for_dicts(data_dicts_empty)
    df = pd.DataFrame.from_dict(data_dicts)
    df = df.reset_index()
       
    return df

In [18]:
train_data = get_df(df_train, rule2rule_id)

100%|██████████████████████████████████████████████████████████████████████████████| 3762/3762 [27:29<00:00,  2.28it/s]


### Getting the Development and Test data

Just as for the Train data, we need a dataframe with a sample, its corresponding rules and the rule IDs. Moreover, we need to add the labels and the label IDs that we obtained earlier when reading in the test data. We do this by merging the the new Dataframe with sample, rule and rule encoding only with the development and test dataframe that contain the labels.

In [21]:
def get_dev_test_df(rule2rule_id: dict, data: pd.DataFrame, label_id2label: dict) -> pd.DataFrame:

    dev_test_data_without_labels = get_df(data, rule2rule_id)
    dev_test_data = dev_test_data_without_labels.merge(data, how='inner').rename(columns={"label": "enc_labels"})
    dev_test_data["labels"] = dev_test_data['enc_labels'].map(label_id2label)
    
    return dev_test_data

In [22]:
dev_data = get_dev_test_df(rule2rule_id, df_dev, label_id2label)
test_data = get_dev_test_df(rule2rule_id, df_test, label_id2label)
test_data.head()

100%|██████████████████████████████████████████████████████████████████████████████| 3762/3762 [01:49<00:00, 34.44it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3762/3762 [03:44<00:00, 16.79it/s]


Unnamed: 0,index,samples,rules,enc_rules,enc_labels,labels
0,0,[ ] TARGET / Chicago Tribune Lake County Major...,[],[],0,negative
1,1,Round Lake police shooting Round Lake police s...,[],[],0,negative
2,2,PERSON shooting PERSON shooting TARGET / Chica...,[],[],0,negative
3,3,Scene of Round Lake police shooting Scene of R...,[],[],0,negative
4,4,involved shooting TARGET / Chicago Tribune The...,[target.*killed.*by.*officer],[340],0,negative


### Converting to sparse matrix

The Train, Test and Development data that we just stored as Pandas Dataframes should now be converted into a Scipy sparse matrix. The rows of the sparse matrix are the samples and the columns the rules. It is also one-hot-encoded (a cell is 1 if a rule matches a sample.) We inizialize it as an array in the correct size (samples x rules) and set it to 1 if a rule matches the sample. In the end, the array is converted to a sparse matrix.

In [23]:
def get_rule_matches_z_matrix(df: pd.DataFrame) -> sp.csr_matrix:

    z_array = np.zeros((len(df["index"].values), len(rule2rule_id)))

    for index in tqdm(df["index"]):
        enc_rules = df.iloc[index-1]['enc_rules']
        for enc_rule in enc_rules:
            z_array[index][enc_rule] = 1

    rule_matches_z_matrix_sparse = sp.csr_matrix(z_array)

    return rule_matches_z_matrix_sparse

In [24]:
train_rule_matches_z = get_rule_matches_z_matrix(train_data)
dev_rule_matches_z = get_rule_matches_z_matrix(dev_data)
test_rule_matches_z = get_rule_matches_z_matrix(test_data)

100%|████████████████████████████████████████████████████████████████████████| 132680/132680 [00:13<00:00, 9758.63it/s]
100%|██████████████████████████████████████████████████████████████████████████| 20678/20678 [00:02<00:00, 8138.60it/s]
100%|██████████████████████████████████████████████████████████████████████████| 48247/48247 [00:06<00:00, 7859.80it/s]


## Saving the files

In [25]:
Path(os.path.join(data_path, "processed")).mkdir(parents=True, exist_ok=True)

dump(sp.csr_matrix(mapping_rules_labels_t), os.path.join(data_path, "processed", T_MATRIX_TRAIN))

dump(train_data["samples"], os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT))
train_data["samples"].to_csv(os.path.join(data_path, "processed", TRAIN_SAMPLES_CSV), header=True)
dump(train_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_TRAIN))

dump(dev_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT))
dev_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", DEV_SAMPLES_CSV), header=True)
dump(dev_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_DEV))

dump(test_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT))
test_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", TEST_SAMPLES_CSV), header=True)
dump(test_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_TEST))

['C:/Users/Emilie/Uni/2021WS/DS_project/data/police_killing\\processed\\test_rule_matches_z.lib']