# Using stanza for Named Entity Recognition


## Installation

Run the code cell below to install stanza:

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import libraries

In [2]:
import stanza
import os

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [3]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


## Cloning to repository


In [4]:

!git clone https://github.com/kulsoom-za/FASDH25-portfolio2.git



Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4388, done.[K
remote: Counting objects: 100% (4388/4388), done.[K
remote: Compressing objects: 100% (4371/4371), done.[K
remote: Total 4388 (delta 22), reused 4378 (delta 15), pack-reused 0 (from 0)[K
Receiving objects: 100% (4388/4388), 17.83 MiB | 21.41 MiB/s, done.
Resolving deltas: 100% (22/22), done.


## Filter only 2024 articles

In [5]:
# Set the path to the folder containing article files
path = "/content/FASDH25-portfolio2/articles"
#List all files in the folder and filter to get only January 2024 articles
files = os.listdir(path)
# Keep only Jan 2024 articles
jan_files = [f for f in files if f.startswith("2024-01-")]
# Show how many were found
print("January files found:", len(jan_files))


January files found: 326


### Place names
Use a loop to print only the named entities that are place names.

If you don't remember how to do that, look back at last week's notebook!

In [6]:
# check if there are any January files in the folder before proceeding
if jan_files:

    # open the first article in the list of January files
    with open(os.path.join(path, jan_files[0]), encoding="utf-8") as file:

        # read the full text of the article into a variable called 'text'
        text = file.read()

        # pass the article text through the stanza NLP pipeline to analyze it
        doc = nlp(text)

        # loop through the named entities found in the document
        for e in doc.entities:

            # only print entities that are places
            if e.type in ["GPE", "LOC"]:
                print(e)



{
  "text": "Israel",
  "type": "GPE",
  "start_char": 52,
  "end_char": 58
}
{
  "text": "Jerusalem",
  "type": "GPE",
  "start_char": 151,
  "end_char": 160
}
{
  "text": "Haifa",
  "type": "GPE",
  "start_char": 164,
  "end_char": 169
}
{
  "text": "Israel",
  "type": "GPE",
  "start_char": 586,
  "end_char": 592
}
{
  "text": "Gaza",
  "type": "GPE",
  "start_char": 619,
  "end_char": 623
}
{
  "text": "Haifa",
  "type": "GPE",
  "start_char": 1013,
  "end_char": 1018
}
{
  "text": "Israel",
  "type": "GPE",
  "start_char": 1172,
  "end_char": 1178
}
{
  "text": "Israel",
  "type": "GPE",
  "start_char": 3162,
  "end_char": 3168
}
{
  "text": "Palestine",
  "type": "GPE",
  "start_char": 3560,
  "end_char": 3569
}
{
  "text": "Gaza",
  "type": "GPE",
  "start_char": 4906,
  "end_char": 4910
}


### Counting place names

We can now use a dictionary to count how many times each place is counted in the text, as we did with regular expressions:

In [7]:
# create an empty dictionary
places = {}

# loop through the entities:
for e in doc.entities:
  # add a condition so that only place names are processed:
  if e.type in ["GPE", "LOC"]:
    # add the count to the dictionary:
    places[e.text] = places.get(e.text, 0) + 1

print(places)


{'Israel': 4, 'Jerusalem': 1, 'Haifa': 2, 'Gaza': 2, 'Palestine': 1}


### Extract and Clean Place Names

In [8]:
# Extract and clean geopolitical entities (GPE)
place_counts = {}  # Initialize place_counts dictionary before using it

for e in doc.ents:
    if e.type == "GPE":
        name = e.text

        # Remove possessives
        if name.endswith("’s") or name.endswith("'s"):
            name = name[:-2]

        # Clean punctuation and whitespace
        name = name.strip(" ,.!?;:\n\t")

        # Update count, initializing if not seen before
        place_counts[name] = place_counts.get(name, 0) + 1

print(doc.text)

Reporter’s Notebook: Covering an antiwar protest in Israel

-----

It’s a crisp sunny Saturday morning as our crew prepares the car for the drive from Jerusalem to Haifa to cover an antiwar rally. Spirits are high as I place my camera equipment in the boot of the car. Then we discuss footwear.
Stefanie, our correspondent, has chosen to wear comfortable white trainers, expecting the likelihood of violence to be low. However, Luke, whom we’ve hired to provide security, and I have plumped for sturdy boots in case things get heated.
This is the first antiwar protest to take place in Israel since it began its war on Gaza following the Hamas attacks of October 7.
Since, it hasn’t been easy for the antiwar voice to make itself heard. The organisers of this rally, Hadash, a left-wing socialist party that supports a two-state solution, were initially banned from gathering and had to take their request to the Supreme Court.
For us, even finding the protest location proves difficult. As we near t

### Storing data in a tsv file


In [9]:
# Define the name and path of the output file
filename = "/content/FASDH25-portfolio2/ner_counts.tsv"

# open the file in writing mode using UTF-8 encoding
with open(filename, mode="w", encoding="utf-8") as file:
    # create the header line: column names separated by a tab
    header = "Place\tCount\n"
    # write the header to the file
    file.write(header)

    # now loop through the place_counts dictionary (cleaned place names and their counts)
    for place, count in place_counts.items():
        # create a row with the place and count separated by a tab
        row = f"{place}\t{count}\n"
        # write the row to the file
        file.write(row)


The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [10]:
with open("/content/FASDH25-portfolio2/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
Israel	4
Jerusalem	1
Haifa	2
Gaza	2
Palestine	1

