# Information Extraction for Social Science Research

This tutorial will introduce you to *information extraction* for social science: techniques for turning documents into structured data by extracting specific words, phrases, or pieces of information from within documents.

----

Let's get started by installing `spaCy`, a library for doing natural language processing, and download some other data we'll need for the tutorial.
"""


In [1]:
!nvcc --version

!pip install --upgrade spacy

!pip install --upgrade spacy[cuda111,transformers]

!pip install jsonlines
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_sm

!wget https://andrewhalterman.com/files/cleaned_masdar.jsonl

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Collecting spacy-transformers<1.4.0,>=1.1.2 (from spacy[cuda111,transformers])
  Downloading spacy_transformers-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting cupy-cuda111<13.0.0,>=5.0.0b4 (from spacy[cuda111,transformers])
  Downloading cupy_cuda111-12.3.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting transformers<4.37.0,>=3.4.0 (from spacy-transformers<1.4.0,>=1.1.2->spacy[cuda111,transformers])
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers<1.4.0,>=1.1.2->spacy[cuda111,transformers])
  Downloading spacy_a

## Getting started with NER and spaCy

In [2]:
import jsonlines

from tqdm.autonotebook import tqdm
import jsonlines
import re

import spacy
from spacy import displacy
# assert spacy.__version__ == "3.1.3"

  from tqdm.autonotebook import tqdm
--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy-cuda111, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

   

`spaCy` requires a pretrained model to process a document. Here, we're using the "large" model trained on English language web and news text. `spaCy` has other models including a faster `en_core_web_sm` without pretrained embeddings and `en_core_web_trf`, a transformer-based model that is more accurate but which requires more storage and more time to run. We can also load the small model in case we want to compare the speed/accuracy tradeoff of the large and small models.


In [3]:
nlp = spacy.load("en_core_web_lg")
nlp_sm = spacy.load("en_core_web_sm")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


Next, we'll load in a collection of news stories from a local pro-government newspaper in Syria called al-Masdar. The articles here primarily describe the civil war in Syria in 2016 and 2017.


In [4]:
with jsonlines.open("cleaned_masdar.jsonl", "r") as f:
    articles = list(f.iter())

print(len(articles))

7729


In [5]:
article = articles[313]
article

{'date': '13/02/2017',
 'body': 'BEIRUT, LEBANON (10:20 P.M.) – The Russian Air Force has launched several airstrikes over the eastern countryside of Aleppo tonight, hitting several of the Islamic State’s (ISIS) positions between Deir Hafer and Al-Bab.\nRussian jets traveled from the Hmaymim Military Airport in southwest Latakia to the Aleppo Governorate tonight in order to aid the Syrian and Turkish armies currently battling with the Islamic State forces in the Al-Bab Plateau and Deir Hafer Plain.\nAccording to local reports, Russian jets primarily focused on the\xa0\xa0road leading from Al-Bab to Deir Hafer; this area is where the Syrian Arab Army is currently attacking the Islamic State forces.\nRussian jets are still launching airstrikes this minute, forcing the Islamic State terrorists to to avoid launching counter-attacks against the Syrian and Turkish armies in east Aleppo.',
 'title': 'Russian jets hammer ISIS with nonstop airstrikes in east Aleppo'}

To process a document with `spaCy`, we'll use the `nlp` object we instatiated earlier and pass a piece of text to it. The `nlp` object returns a Document class object, which has both document and token-level attributes.

In [6]:
doc = nlp(article['body'])

In [7]:
# take a look at how many words in a document
len(doc)

158

In [8]:
# look document-level attributes
dir(doc)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_context',
 '_get_array_attrs',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set

In [9]:
# tokens in a document can by accessed by their number:
print(doc[5])
dir(doc[5])

P.M.


['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

One of the attributes it assigns is named entity information for the document. Using spaCy's built-in visualizer, we can see all the detected named entities in the document.

In [10]:
displacy.render(doc, style="ent", jupyter=True)

TIP: to look up what a label returned by spaCy means, you can you use the `spacy.explain()` function. So, for example,

In [11]:
spacy.explain("GPE")

'Countries, cities, states'

we can use some awesome things from displacy

In [12]:
options = {"ents": ["ORG", "GPE"], "colors": {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}}
# options = {"ents": ["ORG", "GPE"]
# colors = {"ORG": "red"}
# colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}


In [13]:
displacy.render(doc, style="ent", jupyter=True, options=options)

- Can you spot an error in the NER results?
- What if you use the small model instead of the large model? (How would you do that?)

Let's get our documents processing in the background using `spaCy`'s efficient `nlp.pipe` method and then turn to some theory and applications.

In [14]:
just_text = [i['body'] for i in articles]
docs = list(tqdm(nlp.pipe(just_text), total=len(just_text)))

  0%|          | 0/7729 [00:00<?, ?it/s]

In [15]:
len(docs)

7729

## NER Applications

What kinds of questions can we answer with NER and how does this fit in with our research?

The simplest questions are simple descriptive question, especially questions that could be useful at the beginning of a research project or when a research would like to understand the contents of a corpus better.

One of the simplest questions that someone to ask is which  people, organizations, and locations are mentioned most?

- **Question**: How could this be useful in research?

As an example, let's identify which organization are mentioned most in our corpus.

At this point our documents should all be processed. Do a quick check that they are:

In [None]:
from collections import Counter

all_orgs = []
for d in docs:
    orgs = [ent.text for ent in d.ents if ent.label_ == "ORG"]
    all_orgs.extend(orgs)

Counter(all_orgs).most_common(15)

**Questions**

- What does this tell us substantively about the conflict?
- How could the results be more useful? /what's missing that would make them more insightful.

### Another NER example: ceasefires and organizations

A simple extention is to study which organizations are mentioned alongside certain keywords. In practice, this would probably involve looking at organizations alongside document classifications or topics, but we can use keywords as a rough approximation here.


**Exercise**: Which organizations are mentioned most alongside mentions of "ceasefires" or "negotiations"?

In [21]:
# write code here

# Hint: ent.sent.text will return the text of the sentence where entity `ent` is mentioned

#@title
negotiation_orgs = []
for d in docs:
    for ent in d.ents:
        if ent.label_ != "ORG":
            continue
        if re.search("negotiat|ceasefire|talks", ent.sent.text):
            negotiation_orgs.append(ent.text)

In [23]:
from collections import Counter
Counter(negotiation_orgs).most_common(10)

[('UN', 214),
 ('the United Nations Special Envoy', 159),
 ('Staffan de Mistura', 159),
 ('Alfano', 107),
 ('United Nations', 107),
 ('the Syrian Opposition', 97),
 ('The Syrian Arab Army', 1),
 ('SAA', 1)]

## Dependency parses

Named entity recognition is useful for identifying named entities in isolation or in the context of other terms or concepts. NER on its own tells us little about the relationships between named entities. Often, the relationship between entities is the interesting piece of information for applied researchers, and we can get at that relationship by using the grammar of the sentence.


Dependency parses are a way of representing the syntax or grammar of a sentence. For example, a dependency parse might identify that a particular verb is a noun, and specifically that it is the subject noun of a sentence.

While this isn't strictly speaking information extraction (although it is structured prediction), having access to a dependency parse can be very valuable in extracting information from documents.

First, let's look at how a dependency parse encodes grammatical information by using spaCy's dependency visualizer.

In [24]:
doc = nlp(articles[313]['body'])

In [25]:
sent = list(doc.sents)[1]

In [26]:
displacy.render(sent, style="dep", jupyter=True)

You can think about dependency parses as a greatly enhanced form of part of speech tagging. While part of speech tagging assigns labels to individual words, like "Russian" being an ADJ[ective], "jets" being a NOUN, and "traveled" being a VERB (the labels below each word), dependency parsing goes a step further and tells you that the noun "jet" is specifically the subject noun ("nsubj") of the verb "traveled", and that "Russian" is not only an adjective, but specifically an adjective that is modifying the word "jets".

The implementation of dependency parsers is beyond the scope of this tutorial. Dependency parsing is a more complicated task than named entity recognition, given that a model needs to infer a tree structure that is subject to constraints (e.g. each word can only have a single "head" word immediately above it in the tree), and also needs to predict the correct label for each relationship. A useful list of different implementations is available at [Papers With Code](https://paperswithcode.com/task/dependency-parsing).

### Example information extraction with dependency parses

On its own, a dependency parse doesn't give you the ability to extract information from documents. That said, the information within a dependency parse can help you with a rule-based for extracting information.

One thing we might want to be able to extract from text is generally what kinds of behaviors or actions are occurring in a particular location. Let's write a function to identify verbs + direct objects that are grammatically linked to a location.

In [27]:
print(doc)

BEIRUT, LEBANON (10:20 P.M.) – The Russian Air Force has launched several airstrikes over the eastern countryside of Aleppo tonight, hitting several of the Islamic State’s (ISIS) positions between Deir Hafer and Al-Bab.
Russian jets traveled from the Hmaymim Military Airport in southwest Latakia to the Aleppo Governorate tonight in order to aid the Syrian and Turkish armies currently battling with the Islamic State forces in the Al-Bab Plateau and Deir Hafer Plain.
According to local reports, Russian jets primarily focused on the  road leading from Al-Bab to Deir Hafer; this area is where the Syrian Arab Army is currently attacking the Islamic State forces.
Russian jets are still launching airstrikes this minute, forcing the Islamic State terrorists to to avoid launching counter-attacks against the Syrian and Turkish armies in east Aleppo.


In [28]:
tok = doc[21]  # "Aleppo"
print(tok)

Aleppo


In [29]:
def loc_to_verb(tok):
    verb_phrase = []
    # first, iterate through all the ancesters of the token
    for i in tok.ancestors:
        # when you get to a verb (using a POS tag)...
        if i.pos_ == "VERB":
            # ...add the verb to the verb phrase list
            verb_phrase.append(i)
            # then, also add the direct object(s) of the verb, as long as the original token
            # is in the same subtree as the direct object
            verb_phrase.extend([j for j in i.children if j.dep_ == "dobj" and tok in i.subtree])
            # we only want the first verb, so stop after we find one
            break
    # expand out the verb phrase to get modifiers ("amod") of the direct object
    for i in verb_phrase:
        for j in i.children:
            if j.dep_ == "amod":
                verb_phrase.append(j)

    # sort the tokens by their position in the original sentence
    new_list = sorted(verb_phrase, key=lambda x: x.i)
    # join them together with the correct whitespace and return
    return ''.join([i.text_with_ws for i in new_list]).strip()

loc_to_verb(tok)

'launched several airstrikes'

We can then use our function to identify all the actions related to a single city, Aleppo.

In [30]:
aleppo_actions = []

for d in docs:
    for i in d:
        if i.text == "Aleppo":
            aleppo_actions.append(loc_to_verb(i))

sorted(list(set(aleppo_actions)))

['',
 'According',
 'added',
 'announced',
 'approached rebel position',
 'arrange new agreement',
 'await official reports',
 'beheaded youngster',
 'captured',
 'captured number',
 'captured villages',
 'carried fire',
 'carried strikes Turab',
 'conducted',
 'continued',
 'continued advance',
 'continued powerful assault',
 'continuedassault',
 'controlled City',
 'defending rebel assault',
 'destroyed large depot',
 'distributed',
 'downed',
 'erupted',
 'fight way',
 'fighting',
 'fighting Syrian government',
 'finish study',
 'firing several rockets',
 'hit hospital',
 'holds Aleppo',
 'killed',
 'launched massive offensives',
 'launched several airstrikes',
 'launchedwidescaleoffensive',
 'launching airstrikes',
 'launching counter-attacks',
 'let assault',
 'located',
 'looked',
 'massacring civilians',
 'met garrison',
 'operating',
 'positioned',
 'pounding main road',
 'poured',
 'rained held districts',
 'recaptured area',
 'redeployed',
 'rein Forces',
 'remained',
 'renew

## Summary: Extracting Information with Rules and Dependency Parses

What we've done is an example of rule-based information extraction using dependency parses as a component in our rule-based system. An alternative approach to rule-based extraction is to train a machine learning model to extract the information you're looking for. Machine learning models are often more accurate and less sensitive to small changes in language than rule-based systems are. Rule-based systems have their place, though. Researchers do not need to annotate large amounts of data to create rule-based systems, and it's easier to understand why a system returned the answer it did than is often the case with machine learning systems.

# Extentions and Experiments

### Transformer-based spaCy model

How does the accuracy change if you use the model with pretrained embeddings (`en_core_web_lg`) or the transformer-based model (`en_core_web_trf`)? (If you use the transformer model, you'll probably want to change your runtime to GPU, which will require you to re-install the libraries you installed at the beginning of the notebook).

In [33]:
!python -m spacy download en_core_web_trf

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy-cuda111, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy-cuda111, cupy-cuda12x

  Follow

In [34]:
nlp_trf = spacy.load("en_core_web_trf")

ValueError: [E002] Can't find factory for 'curated_transformer' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, en.lemmatizer

In [35]:
doc = nlp_trf(articles[313]['body'])
displacy.render(doc, style="ent", jupyter=True)

NameError: name 'nlp_trf' is not defined