# Labelling Data

Labelling data is laborious. We spent months reading abstracts by hand.

```{figure} ../images/screenshot.png
:name: labelling

[NACSOS](https://doi.org/10.5281/zenodo.4121525)
```

Each document was coded by hand by at least two independent coders. All disagreements were resolved by discussion, if necessary involving a third coder.

In [None]:
import os
os.chdir('../../../')

A sample of documents we labelled are also included (for demonstration purposes here with a subset of the most common labels).

In [None]:
import pandas as pd
labels = pd.read_feather('data/labels.feather')
print(labels.shape)
labels.head()

:::{attention}
Note that we treat the data in this tutorial as if the labelled documents were a representative sample of all documents. This is for simplicities sake and demonstration purposes. In actual fact, only some documents drawn from a representative sample - for all those that were not, we removed them from our test sets.
:::

## Inclusion

`INCLUDE` is a binary label that takes the value of 1 when a document was included (meaning that it deals with policy instruments of some sort)

In [None]:
import matplotlib.pyplot as plt
labels.groupby('INCLUDE')['id'].count().plot.bar()

## Policy Instrument Type

Policy instrument types are denoted by columns beginning with the prefix `4 -`. Taken together, they can be seen as a multilabel task (each document can be zero or more policy instruments)

In [None]:
instruments = [x for x in labels.columns if "4 -" in x]
labels[instruments].sum().plot.bar()

## Sector

Sectors are denoted by columns beginning with the prefix `4 -`. They can also be seen as a multilabel task (each document can be zero or more sectors)

In [None]:
sectors = [x for x in labels.columns if "8 -" in x]
labels[sectors].sum().plot.bar()

Documents that are relevant usually mention 1 or more specific instrument types in 1 or more sectors (cross-sectoral refers to instruments that simply talk about reducing emissions in general)

In [None]:
import numpy as np
import seaborn as sns
m = np.zeros((len(sectors),len(instruments)))
for i, sec in enumerate(sectors):
  for j, inst in enumerate(instruments):
    m[i,j] = labels[(labels[sec]==1) & (labels[inst]==1)].shape[0]
sns.heatmap(
  m,
  xticklabels=instruments,
  yticklabels=sectors,
  cmap='Blues',
  annot=True
)
plt.show()