# Police Killing Dataset: Data Preprocessing with Knowledge Base

This tutorial shows how to find names of people killed by the police in a corpus of newspaper articles. The corpus was created by Katherine A. Keith et al. (2017) for a similar task using distant supervision. This dataset contains mentions of people (based on keywords related to “killing” or “police”) who might have been killed by the police. The dataset (the HTML documents scraped in 2016 themselves as well as the already sentence-segmented data) are available on the [project’s website](http://slanglab.cs.umass.edu/PoliceKillingsExtraction/) and on [MinIO]( https://knodle.dm.univie.ac.at/minio/knodle/datasets/police_killing/). 

### Data Description

There is a train and a test dataset, both of them containing dictionaries with the following keys:

-	docid: unique identifiers of every mention of a person possible killed by the police
-	name: the normalized name of the person
-	downloadtime: time the document was downloaded
-	names_org: the original name of the person mentioned in the document
-	sentnames: other names in the mention (not of the person possibly killed by the police)
-	sent_alter: the mention, name of the person possible killed by the policed replaced by “TARGET”, any other names replaced by “POLICE”
-	plabel: for the training data possibly erroneous labels obtained using weak supervision and gold labels for the test data – in this project, only the labels of the test data will be used
-	sent_org: the original mention

Using weak supervision, a sample should be labelled positive (1) or negative (0) whether it describes the killing of a civilian by the police. Later, we can create a database of civilians' names in positive labelled samples.

Compared to the [Data Preprocessing with RegEx Tutorial](https://github.com/knodle/knodle/blob/feature/%23299_police_killing_dataset/examples/data_preprocessing/police_killing/data_preprocessing_with_regex.ipynb), where RegEx are used in order to cover various different ways a sentence might indiciate that someone was killed by the police, in this tutorial the sentences will be labeled using information from a manually created Knowledge Base. Just like Keith et al., we use the [Fatal Encounters Database](https://fatalencounters.org/) (FE-Database). 

**Reference:**

Keith, Kathrine A. et al. (2017): Identifying civilians killed by police with distantly supervised entity-event extraction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. doi: [10.18653/v1/D17-1163](https://aclanthology.org/D17-1163/)

## Imports

In [1]:
import json
import os
import re
import sys
from pathlib import Path
from typing import List, Dict, Union, Set
from itertools import combinations
from itertools import islice

import numpy as np
import pandas as pd
import scipy.sparse as sp
from joblib import dump
from minio import Minio
from tqdm import tqdm

from knodle.examples.data_preprocessing import get_mapping_rules_labels_t

## 1 Get the data

First of all, the file names for the output at the end of this notebook are defined. After that, the raw data can be downloaded from MinIO.

In [2]:
# define the files names
Z_MATRIX_TRAIN = "train_rule_matches_z.lib"
Z_MATRIX_DEV = "dev_rule_matches_z.lib"
Z_MATRIX_TEST = "test_rule_matches_z.lib"

T_MATRIX_TRAIN = "mapping_rules_labels_t.lib"

TRAIN_SAMPLES_OUTPUT = "df_train.lib"
DEV_SAMPLES_OUTPUT = "df_dev.lib"
TEST_SAMPLES_OUTPUT = "df_test.lib"

# file names for .csv files
TRAIN_SAMPLES_CSV = "df_train.csv"
DEV_SAMPLES_CSV = "df_dev.csv"
TEST_SAMPLES_CSV = "df_test.csv"

# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/police_killing"
os.makedirs(data_path, exist_ok=True)
os.path.join(data_path)

'../../../data_from_minio/police_killing'

In [None]:
client = Minio("knodle.cc", secure=False)
files = [
    "train.json", "test.json", 
    "FATAL ENCOUNTERS DOT ORG SPREADSHEET (See Read me tab).xlsx" 
    # I know the name is ugly but it's called exactly like this when downloading it from the website. 
    # I want to keep the name in case anyone downloads it from FE instead of Minio. Is that okay?
]
for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/police_killing/", file),
        file_path=os.path.join(data_path, file),
    )

### 1.1 Get the train data

We read the downloaded data and convert it to a Pandas Dataframe. For the train data, we take only the samples and the name of the potential victim and for the test data, we take the samples, the names and also the labels. In the samples, the name of the potential victim is replaced by the TARGET symbol. We rename this column to "samples".

In [3]:
def get_train_data(data_path: str) -> pd.DataFrame:
    with open(os.path.join(data_path, "train.json"), 'r') as data:
        train_data = [json.loads(line) for line in data]
    df_train_sent_alter = pd.DataFrame(train_data, columns = ["name", "sent_alter", "names_org"]).rename(columns={"sent_alter": "sample"})
    return df_train_sent_alter

df_train = get_train_data(data_path)
df_train.head()

Unnamed: 0,name,sample,names_org
0,Rodney Thomas,"Two years earlier , Officer TARGET was killed ...",[Rodney Thomas]
1,Howard Ave,Police Chief PERSON said Randolph was found sh...,"[Howard Ave, Howard Park Ave]"
2,Clayton Fernander,"In the latest incident , Chief Superintendent ...",[Clayton Fernander]
3,Richard Pickels,Chief TARGET of Penn Township police entered t...,[Richard Pickels]
4,Cathy Zuraw,A man was was fatally shot by a police officer...,[Cathy Zuraw]


Already at a first look at the samples we can see that the data is flawed. *TARGET* is supposed to replace the name of the victim, while the NE-Tagger Keith et al. used to prepare the data actually quite often identified the name of a police officer as *TARGET*. In this case, a sample should be labelled negative, even if it describes the killing of a civilian by the police.

### 1.2 Get the Dev and Test Data

Since the [SLANG Lab](http://slanglab.cs.umass.edu/PoliceKillingsExtraction/) provides only train and test data, but no development data, the part of the test data will be used as a development set. The samples for the development data will be selected randomly to avoid imbalances of positive and negative samples in dev and test data.

The parameter *used_as_dev* reflects the amount of the gold data that should be used for development instead of testing. It is set to 30% for now, but can be changed depending on the task definition.

In [4]:
used_as_dev = 30
print(f"{used_as_dev}% of the test data will be used for develoment.")

30% of the test data will be used for develoment.


In [5]:
def get_dev_test_data(data_path: str) -> Union[pd.DataFrame, pd.DataFrame]:
    with open(os.path.join(data_path, "test.json"), 'r') as data:
        dev_test_data = [json.loads(line) for line in data]
    dev_test_sent_alter = pd.DataFrame(dev_test_data, columns = ["name", "names_org", "sent_alter", "plabel"]).rename(columns={"sent_alter": "sample", "plabel": "label"})
    df_dev = dev_test_sent_alter.sample(n = int(round((dev_test_sent_alter.shape[0]/100)*used_as_dev))).reset_index(drop = True)
    df_test = dev_test_sent_alter.drop(df_dev.index).reset_index(drop = True)
    return df_dev, df_test

df_dev, df_test = get_dev_test_data(data_path)
df_test.head()

Unnamed: 0,name,names_org,sample,label
0,Joe Shuman,[Joe Shuman],[ ] TARGET / Chicago Tribune Lake County Major...,0
1,Joe Shuman,[Joe Shuman],Round Lake police shooting Round Lake police s...,0
2,Joe Shuman,[Joe Shuman],PERSON shooting PERSON shooting TARGET / Chica...,0
3,Joe Shuman,[Joe Shuman],Scene of Round Lake police shooting Scene of R...,0
4,Joe Shuman,[Joe Shuman],involved shooting TARGET / Chicago Tribune The...,0


### 1.3 Get Data of the Knowledge Base

The FE-Database contains information about people dying or experiencing violence in encounters with the police, including the name of the victim, a short description of the incident and the use of force (whether a person in an encounter was, for instance, killed by the police or committed suicide or died in an accident while the police was present). For us, only the entries about actual police killings (no suicide, no accidents etc.) are relevant.

It can be downloaded as an [Excel spreadsheet from Fatal Encounters](https://docs.google.com/spreadsheets/d/1dKmaV_JiWcG8XBoRgP8b4e9Eopkpgt7FL7nyspvzAsE/edit#gid=0). After that, we want to exclude all the information that is not useful for our task. 

We only need the first sheet that contains the actual information. In this sheet, we only need the name of the victim and the column *Intended use of force (Developing)*. We exclude all entries that do not describe *Deadly Force*.

In the end, we keep only the relevant names.

In [6]:
fe_database = pd.read_excel(os.path.join(data_path, "FATAL ENCOUNTERS DOT ORG SPREADSHEET (See Read me tab).xlsx"), sheet_name = 0)

In [7]:
fe_database.head()

Unnamed: 0,Unique ID,Name,Age,Gender,Race,Race with imputations,Imputation probability,URL of image (PLS NO HOTLINKS),Date of injury resulting in death (month/day/year),Location of injury (address),...,URL Temp,Brief description,"Dispositions/Exclusions INTERNAL USE, NOT FOR ANALYSIS",Intended use of force (Developing),Supporting document link,"Foreknowledge of mental illness? INTERNAL USE, NOT FOR ANALYSIS",Unnamed: 32,Unnamed: 33,Unique ID formula,Unique identifier (redundant)
0,25747.0,Mark A. Horton,21,Male,African-American/Black,African-American/Black,Not imputed,,2000-01-01,Davison Freeway,...,,Two Detroit men killed when their car crashed ...,Unreported,Pursuit,https://drive.google.com/file/d/1-nK-RohgiM-tZ...,No,,,,25747.0
1,25748.0,Phillip A. Blurbridge,19,Male,African-American/Black,African-American/Black,Not imputed,,2000-01-01,Davison Freeway,...,,Two Detroit men killed when their car crashed ...,Unreported,Pursuit,https://drive.google.com/file/d/1-nK-RohgiM-tZ...,No,,,,25748.0
2,25746.0,Samuel H. Knapp,17,Male,European-American/White,European-American/White,Not imputed,,2000-01-01,27898-27804 US-101,...,,Samuel Knapp was allegedly driving a stolen ve...,Unreported,Pursuit,https://drive.google.com/file/d/10DisrV8K5ReP1...,No,,,,25746.0
3,25749.0,Mark Ortiz,23,Male,Hispanic/Latino,Hispanic/Latino,Not imputed,,2000-01-01,600 W Cherry Ln,...,,A motorcycle was allegedly being driven errati...,Unreported,Pursuit,https://drive.google.com/file/d/1qAEefRjX_aTtC...,No,,,,25749.0
4,1.0,LaTanya Janelle McCoy,24,Female,African-American/Black,African-American/Black,Not imputed,,2000-01-02,5700 block Mack Road,...,,LaTanya Janelle McCoy's car was struck from be...,Unknown,Pursuit,http://www.recordnet.com/article/20000110/A_NE...,No,,,,1.0


In [7]:
fe_database = fe_database[['Name', 'Intended use of force (Developing)']]
fe_database = fe_database[fe_database['Intended use of force (Developing)'] == "Deadly force" ]
fe_database = fe_database.drop(['Intended use of force (Developing)'], axis=1).reset_index(drop = True).rename(columns={"Name": "name"}) 
fe_database.head()

Unnamed: 0,name
0,Lester Miller
1,Derrick E. Tate
2,John Edward Pittman
3,Kyle Dillon
4,Adrian Dolby


### 1.4 Some Statistics

In [8]:
# Count of samples
print(f"Number of samples:")
print(f"Train data: {df_train.shape[0]}")
print(f"Development data: {df_dev.shape[0]}")
print(f"Test data: {df_test.shape[0]}")

Number of samples:
Train data: 132833
Development data: 20678
Test data: 48247


In [9]:
# Positive and negative instances in dev and test data
positive_dev = df_dev.groupby("label").count()["sample"][1]
negative_dev = df_dev.groupby("label").count()["sample"][0]
positive_test = df_test.groupby("label").count()["sample"][1]
negative_test = df_test.groupby("label").count()["sample"][0]
print(f"In the develoment data, {positive_dev} ({(100/df_dev.shape[0])*positive_dev}%) instances are positive and {negative_dev} instances ({(100/df_dev.shape[0])*negative_dev}%) are negative.")
print(f"In the test data, {positive_test} ({(100/df_test.shape[0])*positive_test}%) instances are positive and {negative_test} instances ({(100/df_test.shape[0])*negative_test}%) are negative.")

In the develoment data, 4453 (21.534964696779188%) instances are positive and 16225 instances (78.46503530322082%) are negative.
In the test data, 8878 (18.40114411258731%) instances are positive and 39369 instances (81.5988558874127%) are negative.


### 1.5 Output classes

Our task is to find out whether a sentence describes the killing of a person by the police or does not. That means, it is a binary classification task with two output classes. The number of classes is defined with the *num_classes* parameter.

In [10]:
num_classes = 2

## 2 Get the Rules

For this task, the rules will be the names of people found in both our data and the FE-Database. A rule matches if a name found in the FE-Database can also be found in a sample.

### 2.1 Standardizing the Names

The problem when trying to match the FE-Database with our samples is that there might be different version of single names. For their dataset, Keith. et al used a standardized name version and kept all alternative names in a list in the column *names_org*. If might be possible that the FE-Database contains name versions which do not match with the standardized name of Keith et al., but with one of the versions in the *names_org*-list. (For instance, someone is called *Steiney James Richards Jr.* in the FE-Database and *James Richard* in our samples.)

We start solving this problem by mapping the standardized name in all three datasets to the different name versions. Then we can create a "reversed" dictionary containing all the different name versions as keys and the corresponding standardized name as value. This does not guarantee yet that we can find an exact name of the FE-Database in one of the name versions. Sometimes, a person has two forenames in the FE-Database, but only one forename is kept in the dataset of Keith et al. Therefore, we will also have to expand the list of names in the FE-Database. For persons whose name consists of more than two parts, we will add all different combinations of their name. 


After that, we can create the *intersection_fe_and_samples*-list, which contains the standardized names of people mentioned in both datasets and also considers all different versions. (We add a name—a value of the *names_org2names*-Dictionary—to the list if the name can be found in both the keys of the *names_org2names*-Dictionary and in the FE-Database.)


In [11]:
names2names_org = dict(zip(df_train["name"].to_list(), df_train["names_org"].to_list()))
names2names_org.update(dict(zip(df_dev["name"].to_list(), df_dev["names_org"].to_list())))
names2names_org.update(dict(zip(df_test["name"].to_list(), df_test["names_org"].to_list())))

In [12]:
names_org2names = {}

for name, names_org in names2names_org.items():
    for name_org in names_org: 
        names_org2names[name_org] = name

In [13]:
names_in_fe_database = fe_database["name"].to_list()
new_name_combinations = []

for name in names_in_fe_database: 
    name_parts = list(name.split())
    for i in range(len(name_parts)+1):
        for combination in combinations(name_parts, i): 
            if len(combination) > 1:
                new_name = ""
                for word in combination: 
                    new_name += word
                    new_name += " "  
                new_name = new_name[:-1]
                    
                new_name_combinations.append(new_name)
                

names_in_fe_database += new_name_combinations
names_in_fe_database = list(set(names_in_fe_database))

In [14]:
intersection_fe_and_samples = set([names_org2names[name_org] for name_org in names_org2names.keys() if name_org in names_in_fe_database])

In [15]:
print(f"{len(intersection_fe_and_samples)} of the people mentioned in the training, test and test samples can also be found in the FE-Database.")

2654 of the people mentioned in the training, test and test samples can also be found in the FE-Database.


### 2.2 Mapping Names to Rules

We assign all the names we can find in both datasets to a unique rule ID.

In [16]:
def get_rule2id(intersection_fe_and_samples: Set) -> Dict:
    rule2rule_id = {}
    rule_id = 0
    for name in intersection_fe_and_samples: 
        rule2rule_id[name] = rule_id
        rule_id += 1

    return rule2rule_id

rule2rule_id = get_rule2id(intersection_fe_and_samples)

In [17]:
print(f"There are {len(rule2rule_id)} rules.")
print("\nThe first rules of the rule2rule_id dictionary look like this:")
print(dict(islice(rule2rule_id.items(), 6)))

There are 2654 rules.

The first rules of the rule2rule_id dictionary look like this:
{'John Toles': 0, 'Martin Whittaker': 1, 'Ryan Stokes': 2, 'Javier Garcia': 3, 'Benjamin Ortiz': 4, 'David Lepine': 5}


Secondly, we create a dictionary assigning all rules to their label. There are only two classes (someone was killed by the police or was not killed by the police). Since there are no rules indicating that someone was **not** killed by the police, all rules indicate the positive class 1. Therefore, all values of the rule2label dictionary, containing the rule IDs as keys, can be set to 1.

**Actually, we don't even need these dicts anymore for this preprocesing. Do I have to keep them or can I just delete that?**
(it's just there because there are also rule2label and label2label_id in the other tutorials)

In [18]:
rule2label = {rule_id: 1 for rule_id in rule2rule_id.values()}

Thirdly, we create a label2label_id dictionary. As there are only two classes, this can be done manually.

In [19]:
label2label_id = {"negative":0, "positive":1}

## 3 Build the T matrix (rules x classes)

The rows of the T matrix are the rules and the columns the classes. The T matrix is one-hot encoded. (1 for a rule and its corresponding class.) It will be imported from the data_preprocessing folder of Knodle examples, since the same function can be used in several preprocessing tutorials. 

In [20]:
mapping_rules_labels_t = get_mapping_rules_labels_t(rule2label, num_classes)

## 4 Build the Z matrix (instances x rules)

### 4.1 Match Train Data with the rules

If an entity (a person) that can also be found in the *rule2rule_id*-Dictionary is mentioned in a sample, the name of this entity will be assigned to the rules column in the dataframe. The column enc_rules contains the rule ID corresponding to the name. We do not need the *names_org*-column anymore.

In [21]:
def get_df(data: pd.DataFrame, rule2rule_id: Dict) -> pd.DataFrame: 
    
    rules = [name if name in rule2rule_id.keys() else "" for name in data["name"].to_list()]
    enc_rules = [rule2rule_id[name] if name in rule2rule_id.keys() else "" for name in data["name"].to_list()]
    data["rule"] = rules
    data["enc_rule"] = enc_rules
    data = data.drop(['names_org'], axis=1).reset_index(drop = True)
    data = data.reset_index()
    
    return data

In [22]:
train_data = get_df(df_train, rule2rule_id)
train_data[8:20]

Unnamed: 0,index,name,sample,rule,enc_rule
8,8,Ray Rice,"The ad , which began airing in the Philadelphi...",Ray Rice,1079.0
9,9,Claire Darbyshire,Parkinson for Metro.co.ukThursday 10 Mar 2016 ...,,
10,10,Philando Castile,JPG Banners block the entrance gate as demonst...,Philando Castile,1890.0
11,11,Ty Money,"On November 25 , less than 24 hours after the ...",,
12,12,John Lee,"PERSON , 36 , was shot by SAPD Officer TARGET ...",,
13,13,Patrick Madden,( PERSON / Times Union ) less Troy Mayor TARGE...,,
14,14,Adrian Bankart,Mountain Rescue teams are probing reports of ...,,
15,15,Rahul Dravid,"No respite for Indian banks , but Mallya 's cr...",,
16,16,Brigid Collins,Fuller filed a motion for the Police Departmen...,,
17,17,Samaria Rice,TARGET had alleged that police failed to immed...,,


### 4.2 Match Dev and Test Data with the rules

Just as for the train data, we need a Dataframe with a sample, its corresponding rules, and the rule IDs for dev and test data. Moreover, we need the labels and the label IDs that we obtained earlier when reading the test data. The label ID is already present in the data, the label we add manually. 

In [24]:
def get_dev_test_df(rule2rule_id: Dict, data: pd.DataFrame) -> pd.DataFrame:

    dev_test_data = get_df(data, rule2rule_id)
    dev_test_data = dev_test_data.rename(columns={"label": "enc_label"})
    dev_test_data['label'] = np.where(dev_test_data['enc_label'] == 0, "negative", "positive")
    
    return dev_test_data

In [25]:
dev_data = get_dev_test_df(rule2rule_id, df_dev)
test_data = get_dev_test_df(rule2rule_id, df_test)

In [26]:
test_data.head()

Unnamed: 0,index,name,sample,enc_label,rule,enc_rule,label
0,0,Joe Shuman,[ ] TARGET / Chicago Tribune Lake County Major...,0,,,negative
1,1,Joe Shuman,Round Lake police shooting Round Lake police s...,0,,,negative
2,2,Joe Shuman,PERSON shooting PERSON shooting TARGET / Chica...,0,,,negative
3,3,Joe Shuman,Scene of Round Lake police shooting Scene of R...,0,,,negative
4,4,Joe Shuman,involved shooting TARGET / Chicago Tribune The...,0,,,negative


*(The test data is so confusing!
Some of them didn't actually get killed by the police. Harold Paniyak, for instance, commited suicide.)*

### 4.3 Convert Dataframes to (Sparse) Matrices

The train, test, and development data that we just stored as Pandas Dataframes should now be converted into a Scipy sparse matrix. The rows of the sparse matrix are the samples and the columns are the rules (i.e., a cell is 1 if the corresponding rule matches the corresponding sample, 0 otherwise). We initialize it as an array in the correct size (samples x rules), fill it with 1s and 0s, and convert it to a sparse matrix at the end.

In [29]:
def get_rule_matches_z_matrix(df: pd.DataFrame) -> sp.csr_matrix:

    z_array = np.zeros((len(df["index"].values), len(rule2rule_id)))

    for index in df["index"]:
        rule = df.iloc[index]['enc_rule']
        if rule != "":
            z_array[index][rule] = 1

    rule_matches_z_matrix_sparse = sp.csr_matrix(z_array)

    return rule_matches_z_matrix_sparse

In [30]:
train_rule_matches_z = get_rule_matches_z_matrix(train_data)
dev_rule_matches_z = get_rule_matches_z_matrix(dev_data)
test_rule_matches_z = get_rule_matches_z_matrix(test_data)

## 5 Save the Files

In [32]:
Path(os.path.join(data_path, "processed_kb")).mkdir(parents=True, exist_ok=True)

dump(sp.csr_matrix(mapping_rules_labels_t), os.path.join(data_path, "processed_kb", T_MATRIX_TRAIN))

dump(train_data["sample"], os.path.join(data_path, "processed_kb", TRAIN_SAMPLES_OUTPUT))
train_data["sample"].to_csv(os.path.join(data_path, "processed_kb", TRAIN_SAMPLES_CSV), header=True)
dump(train_rule_matches_z, os.path.join(data_path, "processed_kb", Z_MATRIX_TRAIN))

dump(dev_data[["sample", "label", "enc_label"]], os.path.join(data_path, "processed_kb", DEV_SAMPLES_OUTPUT))
dev_data[["sample", "label", "enc_label"]].to_csv(os.path.join(data_path, "processed_kb", DEV_SAMPLES_CSV), header=True)
dump(dev_rule_matches_z, os.path.join(data_path, "processed_kb", Z_MATRIX_DEV))

dump(test_data[["sample", "label", "enc_label"]], os.path.join(data_path, "processed_kb", TEST_SAMPLES_OUTPUT))
test_data[["sample", "label", "enc_label"]].to_csv(os.path.join(data_path, "processed_kb", TEST_SAMPLES_CSV), header=True)
dump(test_rule_matches_z, os.path.join(data_path, "processed_kb", Z_MATRIX_TEST))

['../../../data_from_minio/police_killing\\processed_kb\\test_rule_matches_z.lib']

## Rule Accuracy and Some Statistics

For the rule accuracy, we will compare the weak labels of the test data to the gold labels to check how reliable the rules are. 

In [34]:
positive_test_samples = test_data[test_data.enc_label == 1].shape[0]
negative_test_samples = test_data[test_data.enc_label == 0].shape[0]

In [36]:
true_positive = 0
true_negative = 0
false_positive = 0
false_negative = 0
matched_instances = test_data["enc_rule"].str.len() != 0

for row in range(test_data.shape[0]):
    if test_data.loc[row]["enc_label"] == 1: # the true label is 1
        if matched_instances[row]: # the predicted label is 1
            true_positive += 1
        else: # the predicted label is 0
            false_negative += 1
    else: # the true label is 0
        if matched_instances[row]: # the predicted label is 1
            false_positive += 1
        else: # the predicted label is 0
            true_negative += 1
                 
true_positive_percent = (100 / positive_test_samples) * true_positive
true_negative_percent = (100 / negative_test_samples) * true_negative

In [37]:
print(f"Out of {test_data.shape[0]} samples in the testdata, {positive_test_samples} samples are positive and {negative_test_samples} are negative.\n") 
print(f"By using only the rules to obtain weak labels, {true_positive_percent}% of all positive samples are matched by a rule and therefore labeled as positive. {true_negative_percent}% of all negative samples are correctly classified as negative.\n") 
print(f"True positives: {true_positive} \nTrue negatives: {true_negative} \nFalse positives: {false_positive} \nFalse negatives: {false_negative}")

Out of 48247 samples in the testdata, 8878 samples are positive and 39369 are negative.

By using only the rules to obtain weak labels, 93.19666591574679% of all positive samples are matched by a rule and therefore labeled as positive. 97.46247047169092% of all negative samples are correctly classified as negative.

True positives: 8274 
True negatives: 38370 
False positives: 999 
False negatives: 604


**F1-Score**

In [38]:
relevant = positive_test_samples
retrieved = true_positive + false_positive

prec = true_positive / retrieved
rec = true_positive / relevant

f1 = (2 * prec * rec) / (prec + rec)

print(f"F1-Score: {f1}")

F1-Score: 0.911685306594678


**Quick look at the false negatives - why are they not matched by a rule?**

In [40]:
test_data[(test_data["label"] == "positive") & (test_data["enc_rule"] == "")]

Unnamed: 0,index,name,sample,enc_label,rule,enc_rule,label
118,118,Jessica Nelson,"In addition , police have killed nine black wo...",1,,,positive
753,753,la Cruz,"Attorney TARGET , who represents the family , ...",1,,,positive
1133,1133,Richard Smith,Police identified the officers involved in the...,1,,,positive
1134,1134,Richard Smith,The two officers who shot Jester were identifi...,1,,,positive
1135,1135,Richard Smith,PERSON Father TARGET listens to speakers at a ...,1,,,positive
...,...,...,...,...,...,...,...
47930,47930,Justin White,Two other officers were killed in car crashes ...,1,,,positive
47931,47931,Justin White,Two other officers were killed in car crashes ...,1,,,positive
48089,48089,Frank Viggiano,trailer killed a fellow Linden officer TARGET ...,1,,,positive
48090,48090,Frank Viggiano,"Two of PERSON 's passengers were killed , amon...",1,,,positive


## Finish

Now our dataset is ready for the training.The rule accuracy is surprisingly high when we use the entries of the Fatal Encounters Database as rules! The F1-Score of 0.912 proves that our labels are already almost perfect. In the next step using Knodle we will see if we can even further improve them.