# Data Preprocessing

The aim of this project is to find names of people killed by the police in a corpus of news paper articles. The corpus was created by Katherine A. Keith et al. (2017) for a similar task using distant supervision. This dataset contains mentions of people (based on keywords related to “killing” or “police”) who might have been killed by the police. The dataset (the HTML documents scraped in 2016 themselves as well as the already sentence-segmented data) are available on the [project’s website](http://slanglab.cs.umass.edu/PoliceKillingsExtraction/) and on [MinIO]( https://knodle.dm.univie.ac.at/minio/knodle/datasets/police_killing/). 
There is a train and a test dataset, both of them containing dictionaries with the following keys:

-	docid: unique identifiers of every mention of a person possible killed by the police
-	name: the normalized name of the person
-	downloadtime: time the document was downloaded
-	names_org: the original name of the person mentioned in the document
-	sentnames: other names in the mention (not of the person possibly killed by the police)
-	sent_alter: the mention, name of the person possible killed by the policed replaced by “TARGET”, any other names replaced by “POLICE”
-	plabel: for the training data possibly erroneous labels obtained using weak supervision and gold labels for the test data – in this project, only the labels of the test data will be used
-	sent_org: the original mention


The rules that are used in this notebook are simple wordpairs (one of the killing and one of the police related keywords each). A rule matches a sample if both of the words in the wordpair are part of the sample.

Reference: Keith, Kathrine A. et al. (2017): Identifying civilians killed by police with distantly supervised entity-event extraction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. doi: [10.18653/v1/D17-1163](https://aclanthology.org/D17-1163/)

## Imports

In [2]:
import pandas as pd
import json
import os
import numpy as np
import re
import scipy.sparse as sp
from tqdm import tqdm
from pathlib import Path
from joblib import dump
from typing import List

## Getting the data

First of all, the file names for the output at the end of this notebook are defined. After that, the raw data can be downloaded from MinIO.

In [3]:
# define the files names
Z_MATRIX_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_TEST = "test_rule_matches_z.lib"

T_MATRIX_TRAIN = "mapping_rules_labels_t.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

# file names for .csv files
TRAIN_SAMPLES_CSV = "df_train.csv"
DEV_SAMPLES_CSV = "df_dev.csv"
TEST_SAMPLES_CSV = "df_test.csv"

# define the path to the folder where the data will be stored
#data_path = "../../../data/police_killing"
data_path = "C:/Users/Emilie/Uni/2021WS/DS_project/data/police_killing"
os.makedirs(data_path, exist_ok=True)
os.path.join(data_path)

'C:/Users/Emilie/Uni/2021WS/DS_project/data/police_killing'

**Getting the keywords**

Downloading the keywords from MiniO that will later be used to create the rules.

In [4]:
keywords = pd.read_csv(os.path.join(data_path, "keywords.csv"))

**Getting the Train data**

We read the downloaded data and convert it to a Pandas Dataframe. For now, we take only the samples for the train data and the samples as well as the labels for the test data. In the end, we will also need the name of the person in case it turns out they were killed by the police. However, in this step their name should be replaced by the TARGET symbol. Therefore, we only take the values for the "sent_alter" key and rename them to "samples".

In [5]:
# later first downloaded from Minio

def get_train_data(data_path):
    with open(os.path.join(data_path, "train.json"), 'r') as data:
        train_data = [json.loads(line) for line in data] #a list of dicts
    df_train_sent_alter = pd.DataFrame(train_data, columns = ["sent_alter"]).rename(columns={"sent_alter": "samples"})
    return df_train_sent_alter

df_train = get_train_data(data_path)

In [6]:
df_train.head()

Unnamed: 0,samples
0,"Two years earlier , Officer TARGET was killed ..."
1,Police Chief PERSON said Randolph was found sh...
2,"In the latest incident , Chief Superintendent ..."
3,Chief TARGET of Penn Township police entered t...
4,A man was was fatally shot by a police officer...


**Getting the Development and Test Data**

Since the [SLANG Lab](http://slanglab.cs.umass.edu/PoliceKillingsExtraction/) only provides train and test data, but no development data, the test data must be split in order to be able to use some of it for develoment and some of it for testing. The samples for the develoment data will be selected randomly to avoid imbalances of positive and negative sample in dev and test data.

The parameter *used_as_dev* is the amount of the gold labeled data that should be used for develoment instead of testing. It is set to 30% for now, but can be changed depending on the application of the data.

In [7]:
used_as_dev = 30
print(f"{used_as_dev}% of the test data will be used for develoment.")

30% of the test data will be used for develoment.


In [8]:
def get_dev_test_data(data_path):
    with open(os.path.join(data_path, "test.json"), 'r') as data:
        dev_test_data = [json.loads(line) for line in data]
    dev_test_sent_alter = pd.DataFrame(dev_test_data, columns = ["sent_alter", "plabel"]).rename(columns={"sent_alter": "samples", "plabel": "label"})
    df_dev = dev_test_sent_alter.sample(n = int(round((dev_test_sent_alter.shape[0]/100)*used_as_dev))).reset_index(drop = True)
    df_test = dev_test_sent_alter.drop(df_dev.index).reset_index(drop = True)
    return df_dev, df_test

In [9]:
df_dev, df_test = get_dev_test_data(data_path)

print(f"Number of samples:\n\nTrain data: {df_train.shape[0]}\nDevelopment data: {df_dev.shape[0]}\nTest data: {df_test.shape[0]}")

Number of samples:

Train data: 132833
Development data: 20678
Test data: 48247


In [10]:
df_dev.head()

Unnamed: 0,samples,label
0,Jakarta police spokesman TARGET said three off...,0
1,Mecklenburg police release footage of TARGET s...,1
2,Commissioner TARGET split with his colleagues ...,0
3,"Woman killed , TARGET officer wounded in gas s...",0
4,"JULY 30 , 2006 : A Mesa Police Department Crim...",0


In [11]:
df_test.head()

Unnamed: 0,samples,label
0,[ ] TARGET / Chicago Tribune Lake County Major...,0
1,Round Lake police shooting Round Lake police s...,0
2,PERSON shooting PERSON shooting TARGET / Chica...,0
3,Scene of Round Lake police shooting Scene of R...,0
4,involved shooting TARGET / Chicago Tribune The...,0


**The classes**

Since the goal is to find out whether a sentence describes the killing of a person by the police or does not, it is a binary classification task and there are only two classes. The number of classes is defined as the num_classes.

In [None]:
num_classes = 2

## Getting the rules

In the paper of Keith et al. (2017), two lists of police- and killing-related words are used to extract the relevant mentions: 

"*These lists were semi-automatically constructed by looking up the nearest neighbors of “police” and “kill” (by cosine distance) from Google’s public release of word2vec vectors pretrained on a very large (proprietary) Google News corpus,20 and then manually excluding a small number of misspelled words or redundant capitalizations (e.g. “Police” and “police”).*" (Keith et al. p. 11)

The keywords are saved in the CSV that we have already downloaded and read in as a Pandas Dataframe. Now we will use it to create the rules. Each rule is a pair of a police and a killing word, both of these words must appear in sample.

In the first step, all possible wordpairs are created and added to a dictionary as keys. The values are their unique rule IDs.

In [12]:
def get_rule2id(keywords: pd.DataFrame) -> dict:
    
    rule2rule_id = dict({})
    rule_id = 0
    for police_word in keywords["police_words"]:
        for kill_word in keywords["kill_words"].dropna():
            rule2rule_id[f'{police_word} {kill_word}'] = rule_id
            rule_id += 1
    
    return rule2rule_id

rule2rule_id = get_rule2id(keywords)

Secondly, we create a dictionary assigning all rules to their label. There are only two classes (someone was killed by the police or was not killed by the police). Since there are no rules indicating that someone was **not** killed by the police, all rules indicate the positive class 1. Therefore, all values of the rule2label dictionary, containing the rule IDs as keys, can be set to 1.

In [13]:
rule2label = {rule_id: 1 for rule_id in rule2rule_id.values()}

Thirdly, a dictionary mapping the labels to their ID as well as a dictionary mapping the ID to the corresponding label are required. As there are only two classes, this can be done manually. 

*@Anastasiia I will need the "reversed" label_id2label dict later for my code, but I'm not sure if I'm allowed to to it like this?
And is it okay to create it manually?*

In [14]:
label2label_id ={"negative":0, "positive":1}
label_id2label = {0: "negative", 1: "positive"}

## building the T matrix

The rows of the T matrix are the rules and the columns the classes. The T matrix is one-hot encoded. (1 for a rule and its corresponding class.) It can be imported.

In [15]:
"""
mapping to t matrix 
(from TAC tutorial)
"""

def get_mapping_rules_labels_t(rule2label, num_classes):
    """ Function calculates t matrix (rules x labels) using the known correspondence of relations to decision rules """
    mapping_rules_labels_t = np.zeros([len(rule2label), num_classes])
    for rule, labels in rule2label.items():
        mapping_rules_labels_t[rule, labels] = 1
    return mapping_rules_labels_t

mapping_rules_labels_t = get_mapping_rules_labels_t(rule2label, num_classes)

## buidling the Z matrix (instances x rules)

### Getting the Train Data

For now, the wordpairs contained in the rule2rule_id dictionary are simple strings. To use them as actual rules, they need to be converted to RegEx firstly. We create a dictionary containing the strings of wordpairs as keys and a list of the corresponding RegEx as values. It is important to convert them to RegEx in order to be able to look for the exact words in the wordpair. (The words in the rules must not end in another word character (a-z). If the killing word in a rule is, for instance, "kill", only sentences containing the word "kill" should be matched and no sentences containing words like "kills" or "killed" because there are separate rules for these words.)

Secondly, we apply the RegEx to the samples. In the beginning, everything is stored in a list of dictionaries (one for each individual sample in the train data). The dictionaries contain the sample, a list of that rules that match and a list of the corresponding rule IDs. After that, we convert the list of dictionaries to a Pandas Dataframe.

In [16]:
def convert_rules2regex(rule2rule_id: dict) -> dict:
    
    searches = dict({})
    
    for rule in rule2rule_id.keys():
            wordpair = rule.split()
            search4police_word = f'{wordpair[0]}\W'
            search4kill_word = f'{wordpair[1]}\W'
            searches[rule] = [search4police_word, search4kill_word]
            
    return searches


def get_data_for_df(data: pd.DataFrame, searches: dict) -> list:

    data_for_df = []

    for sample in tqdm(data["samples"].drop_duplicates()):
        data_dict = dict({})
        data_dict["samples"] = sample
        data_dict["rules"] = []
        data_dict["enc_rules"] = []
   
        for rule, search in searches.items():
            if re.search(search[0], sample.lower()) and re.search(search[1], sample.lower()):
                data_dict["rules"].append(rule)
                data_dict["enc_rules"].append(rule2rule_id[rule])

        if data_dict["enc_rules"] != []:
            data_for_df.append(data_dict)

    return data_for_df


def get_df(rule2rule_id: dict, data: pd.DataFrame) -> pd.DataFrame:
    
    searches = convert_rules2regex(rule2rule_id)
    data_for_df = get_data_for_df(data, searches)
    df = pd.DataFrame.from_dict(data_for_df)
    df = df.reset_index()
       
    return(df)

In [17]:
train_data = get_df(rule2rule_id, df_train)
train_data.head()

100%|█████████████████████████████████████████████████████████████████████████| 132680/132680 [03:06<00:00, 710.69it/s]


Unnamed: 0,index,samples,rules,enc_rules
0,0,"Two years earlier , Officer TARGET was killed ...",[officer killed],[24]
1,1,Police Chief PERSON said Randolph was found sh...,[police shot],[5]
2,2,"In the latest incident , Chief Superintendent ...",[police shots],[6]
3,3,Chief TARGET of Penn Township police entered t...,[police shot],[5]
4,4,A man was was fatally shot by a police officer...,"[police shot, officer shot]","[5, 25]"


### Getting the Development and Test data

Just as for the Train data, we need a dataframe with a sample, its corresponding rules and the rule IDs. Moreover, we need to add the labels and the label IDs that we obtained earlier when reading in the test data. We do this by merging the the new Dataframe with sample, rule and rule encoding only with the development and test dataframe that contain the labels.

In [18]:
def get_dev_test_df(rule2rule_id: dict, data: pd.DataFrame, label_id2label: dict) -> pd.DataFrame:

    test_data_without_labels = get_df(rule2rule_id, data)
    test_data = test_data_without_labels.merge(data, how='inner').rename(columns={"label": "enc_labels"})
    test_data["labels"] = test_data['enc_labels'].map(label_id2label)
    
    return test_data

In [19]:
dev_data = get_dev_test_df(rule2rule_id, df_dev, label_id2label)

100%|███████████████████████████████████████████████████████████████████████████| 20673/20673 [00:27<00:00, 740.24it/s]


In [20]:
test_data = get_dev_test_df(rule2rule_id, df_test, label_id2label)
test_data.head()

100%|███████████████████████████████████████████████████████████████████████████| 48193/48193 [01:02<00:00, 765.84it/s]


Unnamed: 0,index,samples,rules,enc_rules,enc_labels,labels
0,0,[ ] TARGET / Chicago Tribune Lake County Major...,[police shooting],[9],0,negative
1,1,Round Lake police shooting Round Lake police s...,[police shooting],[9],0,negative
2,2,PERSON shooting PERSON shooting TARGET / Chica...,[police shooting],[9],0,negative
3,3,Scene of Round Lake police shooting Scene of R...,[police shooting],[9],0,negative
4,4,involved shooting TARGET / Chicago Tribune The...,"[officer killed, officer shooting]","[24, 29]",0,negative


### Converting to sparse matrix

The Train, Test and Development data that we just stored as Pandas Dataframes should now be converted into a Scipy sparse matrix. The rows of the sparse matrix are the samples and the columns the rules. It is also one-hot-encoded (a cell is 1 if a rule matches a sample.) We inizialize it as an array in the correct size (samples x rules) and set it to 1 if a rule matches the sample. In the end, the array is converted to a sparse matrix.

In [21]:
def get_rule_matches_z_matrix(df: pd.DataFrame) -> sp.csr_matrix:

    z_array = np.zeros((len(df["index"].values), len(rule2rule_id)))

    for index in tqdm(df["index"]):
        enc_rules = df.iloc[index-1]['enc_rules']
        for enc_rule in enc_rules:
            z_array[index][enc_rule] = 1

    rule_matches_z_matrix_sparse = sp.csr_matrix(z_array)

    return rule_matches_z_matrix_sparse

In [22]:
train_rule_matches_z = get_rule_matches_z_matrix(train_data)

100%|████████████████████████████████████████████████████████████████████████| 132632/132632 [00:41<00:00, 3179.59it/s]


In [23]:
dev_rule_matches_z = get_rule_matches_z_matrix(dev_data)

100%|██████████████████████████████████████████████████████████████████████████| 20675/20675 [00:07<00:00, 2858.21it/s]


In [24]:
test_rule_matches_z = get_rule_matches_z_matrix(test_data)

100%|██████████████████████████████████████████████████████████████████████████| 48238/48238 [00:18<00:00, 2637.03it/s]


## Saving the files

In [25]:
Path(os.path.join(data_path, "processed")).mkdir(parents=True, exist_ok=True)

dump(sp.csr_matrix(mapping_rules_labels_t), os.path.join(data_path, "processed", T_MATRIX_TRAIN))

dump(train_data["samples"], os.path.join(data_path, "processed", TRAIN_SAMPLES_OUTPUT))
train_data["samples"].to_csv(os.path.join(data_path, "processed", TRAIN_SAMPLES_CSV), header=True)
dump(train_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_TRAIN))

dump(dev_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", DEV_SAMPLES_OUTPUT))
dev_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", DEV_SAMPLES_CSV), header=True)
dump(dev_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_DEV))

dump(test_data[["samples", "labels", "enc_labels"]], os.path.join(data_path, "processed", TEST_SAMPLES_OUTPUT))
test_data[["samples", "labels", "enc_labels"]].to_csv(os.path.join(data_path, "processed", TEST_SAMPLES_CSV), header=True)
dump(test_rule_matches_z, os.path.join(data_path, "processed", Z_MATRIX_TEST))

['C:/Users/Emilie/Uni/2021WS/DS_project/data/police_killing\\processed\\test_rule_matches_z.lib']