# 1. Getting the Data

First we will scrape policies from the gov.ie website.

In your command line, ``cd`` into this repository.

``cd`` into the ``policy_scraping`` task directory, then ``cd`` again into the ``policy_scraping`` scrapy environment.

In [1]:
import os
cwd = os.getcwd() # should be base directory of repository
os.chdir(cwd+"/policy_scraping/policy_scraping")

Run ``scrapy crawl goviefor -O ../outputs/goviefor.json`` (or you can change the -O argument to whatever you would prefer the output file information to be).

This command will generate a json containing the metadata about all the policies as well as download all files to the same outputs directory under ``forestry/full``.

In [None]:
!! scrapy crawl goviefor -O ../outputs/goviefor.json

Next we will consolidate the metadata and text of the policy PDFs into one dictionary.

In [2]:
os.chdir(cwd) # back to base directory
import json
from populate_corpora.pdfs_to_jsons import scrp_itm_to_fulltxt
FILE_DIR= cwd+"/policy_scraping/policy_scraping/outputs" # or whatever output directory you gave the scraper for its output json

In [None]:
with open(cwd+"/policy_scraping/outputs/goviefor.json","r", encoding="utf-8") as f:
    metad = json.load(f)
pdf_dict = scrp_itm_to_fulltxt(metad, FILE_DIR+"/forestry/full")

If you have your own collection of pdfs to process and don't have a metadata file, you can use this next function on just the file directory.

In [None]:
from populate_corpora.pdfs_to_jsons import pdfs_to_txt_dct
pdf_dict = pdfs_to_txt_dct(FILE_DIR+"/forestry/full") # or whatever your policy directory is

For the purposes of this project, we only want the texts of the PDFs in cleaned sentences anyways. So we'll go ahead and extract/clean those sentences, then load them into the dictionary format that doccano (labeling platform) uses. Finally, if we want, we can use a simple keyword search to prelabel some of the sentences with a "incentive class mention" label.

In [None]:
import nltk
from populate_corpora.data_cleaning import get_clean_text_sents, format_sents_for_doccano, prelabeling
EN_TOKENIZER = nltk.data.load("tokenizers/punkt/english.pickle") # need tokenizer for our text cleaning
clean_sents= get_clean_text_sents(pdf_dict, EN_TOKENIZER)
doccano_dict = format_sents_for_doccano(clean_sents)
prelab_doccano_dict = prelabeling(doccano_dict)

Now we can download this dictionary as a json to import into our doccano instance for labeling.

In [None]:
with open(cwd+"/populate_corpora/outputs/ready_to_label.json", 'w', encoding="utf-8") as outfile:
    json.dump(prelab_doccano_dict, outfile, ensure_ascii=False, indent=4)

# 2. Labeling the Data

## Annotator

We used a doccano instance for our labeling, but we also had to do some data validation with an external annotator. This section generates a subset for a labeler from the hand-labeled dataset.

In [3]:
from populate_corpora.annotators import label_dct, resample_dict
from populate_corpora.data_cleaning import dcno_to_sentlab
with open(cwd+"/inputs/19Jan25_firstdatarev.json","r", encoding="utf-8") as f: #our hand-labeled dataset
    dcno_json = json.load(f)

sents_d, labels_d = dcno_to_sentlab(dcno_json)
label_lib = label_dct({"text":sents_d[i], "label":[labels_d[i]]} for i in range(len(sents_d)))
resampled = resample_dict(label_lib)
ann_frame = [{'text':sent, 'label':[]} for key in resampled.keys() for sent in resampled[key]]

with open(cwd+"/inputs/subsample_to_label.json", 'w', encoding="utf-8") as outfile:
    json.dump(ann_frame, outfile, ensure_ascii=False, indent=4)

Now let's check the inter-annotator agreement.

In [4]:
with open(cwd+"/inputs/annotation_odon.json","r", encoding="utf-8") as f: #our hand-labeled dataset
    ann_json = json.load(f)

sents_a, labels_a = dcno_to_sentlab(ann_json)
# correct labels
swap_labs = {'non-incentive':'Non-Incentive', 'fine':'Fine', 'tax deduction':'Tax_deduction', 'credit':'Credit', 'direct payment':'Direct_payment', 'supplies':'Supplies', 'technical assistance':'Technical_assistance'}
sents_a2, labels_a2 = [], []
for i, lab in enumerate(labels_a):
  try:
    labels_a2.append(swap_labs[lab])
    sents_a2.append(sents_a[i])
  except:
    pass

In [7]:
from populate_corpora.annotators import get_common_sentlabs, all_to_bin, all_to_sharedmc
from sklearn.metrics import cohen_kappa_score

s_sents, labels_sc, labels_sa = get_common_sentlabs(sents_d, labels_d, sents_a2, labels_a2)
print(f"All: {cohen_kappa_score(labels_sc, labels_sa)} for {len(labels_sc)} entries")

labs_binc, labs_bina = all_to_bin(labels_sc), all_to_bin(labels_sa)
print(f"Binary: {cohen_kappa_score(labs_binc, labs_bina)} for {len(labs_binc)} entries")

mclabsc, mclaba = all_to_sharedmc(labels_sc, labels_sa, labs_binc, labs_bina)
print(f"Multiclass: {cohen_kappa_score(mclabsc, mclaba)} for {len(mclabsc)} entries")

All: 0.7707100591715976 for 62 entries
Binary: 0.7114788004136505 for 62 entries
Multiclass: 0.9534883720930233 for 26 entries


## Augmentation

We also need to make a new human-in-the-loop dataset using by doing sentence similarity searches with predefined queries. We have five queries for each label.

In [None]:
with open(cwd+"/populate_corpora/outputs/ready_to_label.json","r", encoding="utf-8") as f:
    prelab_doccano_dict = json.load(f)

In [3]:
from populate_corpora.query_augment import run_embedder, run_queries, QUERIES_DCT
from populate_corpora.data_cleaning import dcno_to_only_sents

# loading all sentences, not just the labeled ones
# or reload cwd+"/populate_corpora/outputs/ready_to_label.json"
all_sents = dcno_to_only_sents(prelab_doccano_dict) 
embs, s_sentences, model = run_embedder(sample=False, dev='cuda', data=all_sents, unique=True)
# uses our queries dictionary, but obvs you can make your own
qry_dct = run_queries(embs, s_sentences, model, qry_dct=QUERIES_DCT, dev='cuda', sim_thresh=0.5, res_lim=1000)

NameError: name 'prelab_doccano_dict' is not defined

Now we'll parse the results and create a dataset of sentences labeled by the query process, but we first need to filter them to only include sentences found by at least 4/5 queries for each label.

In [7]:
from populate_corpora.query_augment import consolidate_sents, crossref_sents
lbl_qry_dct = consolidate_sents(qry_dct, QUERIES_DCT)
filt_qry_dct = crossref_sents(lbl_qry_dct, 4)
qry_rs_dataset = [{'text': sent, 'label': lbl} for lbl in list(filt_qry_dct) for sent in filt_qry_dct[lbl]]

In [10]:
with open(cwd+"/populate_corpora/outputs/augmented_to_label.json", 'w', encoding="utf-8") as outfile:
    json.dump(qry_rs_dataset, outfile, ensure_ascii=False, indent=4)

# 3. Fine-Tuning the Model