# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [2]:
import stanza



## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [3]:
stanza.download("en")
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


#Use the corpus in the portfolio repo
Instead of using the corpus texts in the “session_10.1” folder, your script should use
the (much larger) corpus folder in your fork of the portfolio repository.

In [4]:
!git clone https://github.com/jafar756/FASDH25-portfolio2.git


Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4381, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 4381 (delta 9), reused 13 (delta 7), pack-reused 4363 (from 2)[K
Receiving objects: 100% (4381/4381), 17.81 MiB | 15.16 MiB/s, done.
Resolving deltas: 100% (15/15), done.


#Extract only the place names from the articles written in January 2024
Use a condition to make sure that only place names from articles written in January
2024 are extracted.

In [None]:
# import the os module to work with file system operations
import os
# Now I will Define the path where the text files (articles) are stored
corpus_path = "/content/FASDH25-portfolio2/articles"
# start an empty list to store paths of matching
jan_files = []
# Now going through the directory and its subdirectories
for root, _, files in os.walk(corpus_path):
  # Loop through all files in the current directory
    for file in files:
      # Check if the file is from January 2024 and ends with '.txt'
        if '2024-01' in file and file.endswith('.txt'):
            jan_files.append(os.path.join(root, file))

# print the total number of matching files found
print(f"Found {len(jan_files)} January 2024 files.")


Found 326 January 2024 files.


# Place names : Count the number times each named entity that refers to a place is mentioned
in these texts

In [6]:
import stanza
from collections import Counter

# Download and load the English NLP pipeline
stanza.download('en')
# Run only once
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')

# Counter to keep track of place name frequencies
place_counter = Counter()

# Process each January 2024 article
for filepath in jan_files:
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
        doc = nlp(text)
        for ent in doc.ents:
            if ent.type in ['GPE', 'LOC']:  # GPE = geopolitical entities, LOC = location
                place_name = ent.text.strip()
                place_counter[place_name] += 1

# Print a sample of extracted place names and their counts
for place, count in place_counter.most_common(20):
    print(f"{place}: {count}")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


Gaza: 1605
Israel: 1593
US: 706
Iran: 206
South Africa: 200
the Red Sea: 194
Yemen: 182
Lebanon: 175
Palestine: 124
the Gaza Strip: 123
West Bank: 120
the United States: 97
UK: 95
Beirut: 84
Syria: 83
the Middle East: 77
Qatar: 64
Iraq: 62
Washington: 60
Red Sea: 50


#Counting place names

In [7]:
from collections import Counter

# Dictionary to count place name mentions
place_counter = Counter()

# Loop through each January file
for filepath in jan_files:
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
        doc = nlp(text)  # Process the text with stanza

        # Now we will Loop through named entities
        for ent in doc.ents:
            if ent.type in ['GPE', 'LOC']:  # GPE = Geopolitical Entity, LOC = Location
                place = ent.text.strip()
                place_counter[place] += 1  # Count the place name

# Now we will Show the top 20 most mentioned places
for place, count in place_counter.most_common(20):
    print(f"{place}: {count}")


Gaza: 1605
Israel: 1593
US: 706
Iran: 206
South Africa: 200
the Red Sea: 194
Yemen: 182
Lebanon: 175
Palestine: 124
the Gaza Strip: 123
West Bank: 120
the United States: 97
UK: 95
Beirut: 84
Syria: 83
the Middle East: 77
Qatar: 64
Iraq: 62
Washington: 60
Red Sea: 50


#Clean up the named entity names: check if the data contains duplicates and merge
the duplicates, using conditions: e.g., add the count for “Gaza’s” to “Gaza” and
remove “Gaza’s” from the dictionary.

In [8]:
 clean_place_counter = Counter()

for place, count in place_counter.items():
    # Remove possessive endings and extra spaces
    cleaned_name = place.replace("’s", "").replace("'s", "").strip()

    # Optional: You can also normalize case
    cleaned_name = cleaned_name.title()

    # Add the count to the cleaned name
    clean_place_counter[cleaned_name] += count

# Display cleaned top 20 place names
for place, count in clean_place_counter.most_common(20):
    print(f"{place}: {count}")


Israel: 1625
Gaza: 1623
Us: 706
Iran: 209
South Africa: 208
The Red Sea: 199
Yemen: 188
Lebanon: 178
The Gaza Strip: 125
Palestine: 124
West Bank: 122
The United States: 118
Uk: 95
Beirut: 87
Syria: 84
The Middle East: 77
Qatar: 65
Iraq: 64
Washington: 62
Tel Aviv: 51


### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?

Fill in the blanks:

In [9]:
# file path where to save our final list of places and how  they appear
output_path = "/content/ner_counts.tsv"

# We open the file in write mode ('w') and make sure it's using UTF-8 encoding to handle special characters
with open(output_path, 'w', encoding='utf-8') as f:
    # First, we write the header row to the file this just labels the two columns: 'placename' and 'count'
    f.write("placename\tcount\n")

    # Now, we go through all the cleaned place names, starting from the most common
    for place, count in clean_place_counter.most_common():
        # For each place, we write its name and how many times it appeared, separated by a tab
        f.write(f"{place}\t{count}\n")

# Finally, we print a message to let the user know the file was saved successfully
print(f"Saved place name counts to: {output_path}")

Saved place name counts to: /content/ner_counts.tsv


The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [10]:
# Open the file from the path where it's stored
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
    print(file.read())

placename	count
Israel	1625
Gaza	1623
Us	706
Iran	209
South Africa	208
The Red Sea	199
Yemen	188
Lebanon	178
The Gaza Strip	125
Palestine	124
West Bank	122
The United States	118
Uk	95
Beirut	87
Syria	84
The Middle East	77
Qatar	65
Iraq	64
Washington	62
Tel Aviv	51
Red Sea	50
India	50
Ukraine	47
Egypt	44
Jordan	43
Russia	43
Canada	42
The West Bank	40
Rafah	40
United States	40
Saudi Arabia	39
Gaza Strip	34
The Hague	33
Gaza City	32
Germany	31
The United Kingdom	31
China	30
Europe	30
Africa	29
Jerusalem	26
Turkey	25
Middle East	25
Tehran	25
Ramallah	24
West	24
Pakistan	24
East Jerusalem	23
Khan Younis	23
The Gulf Of Aden	23
Doha	19
Jenin	19
Asia	18
London	17
Damascus	17
Belgium	16
Sanaa	15
The United Arab Emirates	14
Netherlands	14
France	14
Deir El-Balah	14
Dc	14
Strip	14
Britain	14
Morocco	14
Australia	13
Erbil	13
United Kingdom	12
Uganda	12
The Cape Of Good Hope	12
Dearborn	12
Michigan	12
Mediterranean	12
Iowa	12
Akrotiri	12
Jabalia	11
Norway	11
U.S.	11
Bahrain	11
Nuseirat	11
Hebron	10

Now, reuse the code above to get the coordinates for the place names from the places we stored in the `ner_counts.tsv` file.

Write a new tsv file, `ner_gazetteer.tsv`, which contains three columns: name, latitude, longitude.