# Using stanza for Named Entity Recognition


## Installation

Run the code cell below to install stanza:

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import libraries

In [2]:
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [3]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


## Cloning to repository


In [4]:
# Clone to FASDH25-portofolio folder
!git clone https://github.com/kulsoom-za/FASDH25-portfolio2.git

Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4505, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 4505 (delta 26), reused 18 (delta 18), pack-reused 4468 (from 2)[K
Receiving objects: 100% (4505/4505), 19.72 MiB | 34.52 MiB/s, done.
Resolving deltas: 100% (76/76), done.


## Filter only Jan 2024 articles

In [5]:
import os

# Create a dictionary to store place name counts
places = {}

# Folder path to your cloned repository's articles
folder = "/content/FASDH25-portfolio2/articles"

# Initialize a counter to see how many articles from January 2024 are found
jan_2024_article_count = 0

# Loop through every file that begin with "2024-01-"
for filename in os.listdir(folder):
  # Check if the filename contains "2024-01"
    if "2024-01-" in filename:
      # If yes then increase the article counter
      jan_2024_article_count += 1
  # create a path to the file:
      path = f"{folder}/{filename}"
  # open and read the file:
      with open(path, encoding="utf-8") as file:
          text = file.read()
    # use the nlp pipeline to analyse the text:
      doc = nlp(text)
    # Loop through all named entities found in the text
      for e in doc.entities:
        # Check if the entity is a geographical or location name
        if e.type in ["GPE", "LOC"]:

        # Add the count fot that place name in the places dictionary
        # If the place already exists, increase its count by 1, otherwise initialize it with 1
         places[e.text] = places.get(e.text, 0) +1

# Print total number of January 2024 articles
print("Total number of articles:", jan_2024_article_count)
# print the dictionary of places names and their count
print( places)


Total number of articles: 326
{'Morocco': 13, 'Israel': 1593, 'Gaza': 1605, 'Rabat': 3, 'United States': 40, 'the United Arab Emirates': 13, 'UAE': 7, 'Bahrain': 11, 'Sudan': 3, 'US': 706, 'Western Sahara': 3, 'Washington': 60, 'Tel Aviv': 49, 'Algeria': 7, 'Marrakesh': 1, 'the Western Sahara': 1, 'Morocco’s': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 120, 'Dena': 1, 'Israel’s': 31, 'Oakland': 1, 'the United States': 97, 'South Africa': 200, 'Jordan': 42, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 43, 'Qatar': 64, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 124, 'Indonesia’s': 1, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 3, 'UK': 95, 'Manchester': 1, 'Yemen': 182, 'Washington, DC': 4, 'India': 50, 'Hyderabad': 1, 'Colombo’s Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 2, 'Iran': 206, 'Kerman': 6, 'Lebanon': 175, 'Bethlehem': 4

## Extract and Clean Place Names

In [6]:
# Import library to use regular expressions to clean the text
import re
# Idea taken from 6.CHATGPT second chat response and written the dictionary manually
standard_names = {
    # Region normalizations (e.g., all Gaza-related names conve to "Gaza")
    'gaza': 'Gaza',
    'gaza city': 'Gaza',

    # Country abbreviations and alternate names
    'us': 'United States',
    'u.s.': 'United States',
    'usa': 'United States',
    'uk': 'United Kingdom',
    'uae': 'United Arab Emirates',
    'britain': 'United Kingdom',

    # Official names mapped to common usage
    'state of israel': 'Israel',
    'islamic republic of iran': 'Iran',
    'republic of yemen': 'Yemen',
    'state of palestine': 'Palestine',

    # Common misspellings and corrections
    'beruit': 'Beirut',
    'dahiyeb': 'Dahiyeh',
    'tel israel': 'Tel Aviv',

    # Sub-region normalizations
    'westbank': 'West Bank',
}

# Function to normalize a place name to a standard form
def normalize_place(place):
    # Remove possessives, punctuation, and "the"
    place = re.sub(r"[’'`]s\b", "", place.strip())
    place = re.sub(r"[^\w\s]", "", place)
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)

    # Make the place name lowercase to match easily
    normalized_input = place.lower()

    # Check if this version exists in the list of key names
    for key in standard_names:
        if normalized inputs == key.lower():
            return standard_names[key].strip()  # Return the corrected names

    # If not found, return the name in title case (like "gaza strip" to "Gaza Strip")
    return place.strip().title()

# Create a new dictionary to store cleaned place names and their counts
cleaned_places = {}

# Go through each place and its count in the original dictionary
for place, count in places.items():
    normalized = normalize_place(place) # Clean the place name
    # Use lower case to avoid dublicates
    normalized_key = normalized.lower()
    # Add the count to the "cleaned_places" dictionary
    # If the place is already there, add to its count; if not start the count
    cleaned_places[normalized_key] = cleaned_places.get(normalized_key, 0) + count
# print the final cleaned and counted place names
print(cleaned_places)

{'morocco': 14, 'israel': 1632, 'gaza': 1655, 'rabat': 3, 'united states': 879, 'united arab emirates': 21, 'bahrain': 11, 'sudan': 3, 'western sahara': 4, 'washington': 62, 'tel aviv': 52, 'algeria': 7, 'marrakesh': 1, 'maghreb': 1, 'ukraine': 47, 'saudi arabia': 39, 'california': 3, 'west bank': 164, 'dena': 1, 'oakland': 1, 'south africa': 208, 'jordan': 43, 'jerusalem': 26, 'east jerusalem': 23, 'egypt': 44, 'qatar': 65, 'kuala lumpur': 4, 'malaysia': 8, 'palestine': 125, 'indonesia': 3, 'jakarta': 2, 'johannesburg': 4, 'london': 17, 'paris': 8, 'vienna': 1, 'berlin': 5, 'amman': 6, 'washington dc': 7, 'united kingdom': 152, 'manchester': 1, 'yemen': 189, 'india': 50, 'hyderabad': 1, 'colombo kollupitiya': 1, 'namibia': 10, 'germany': 31, 'palestinian territories': 1, 'sweden': 3, 'iran': 210, 'kerman': 6, 'lebanon': 178, 'bethlehem': 4, 'nairoukh': 1, 'china': 30, 'italy': 10, 'spain': 7, 'turkey': 25, 'shawawra': 1, 'hague': 39, 'gaza strip': 160, 'khan younis': 23, 'syria': 84, 

## Storing data in a tsv file


In [7]:
# Define the output file name and its path where the data will be saved
filename = "FASDH25-portfolio2/ner_counts.tsv"

# Open the file in writing mode using UTF-8 encoding
with open(filename, mode="w", encoding="utf-8") as file:
    # Create the header line: column names "Place" and "Count" seperated by tab
    header = "Place\tCount\n"
    file.write(header)

    # Loop through each cleaned place name and its count
    for place, count in cleaned_places.items():
        # Create a row with the place and count separated by a tab
        row = f"{place}\t{count}\n"
        # Write the formatted row to the file
        file.write(row)

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [8]:
# Open the saved TSV file in read modeusing UTF-8 encoding
with open("/content/FASDH25-portfolio2/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

Place	Count
morocco	14
israel	1632
gaza	1655
rabat	3
united states	879
united arab emirates	21
bahrain	11
sudan	3
western sahara	4
washington	62
tel aviv	52
algeria	7
marrakesh	1
maghreb	1
ukraine	47
saudi arabia	39
california	3
west bank	164
dena	1
oakland	1
south africa	208
jordan	43
jerusalem	26
east jerusalem	23
egypt	44
qatar	65
kuala lumpur	4
malaysia	8
palestine	125
indonesia	3
jakarta	2
johannesburg	4
london	17
paris	8
vienna	1
berlin	5
amman	6
washington dc	7
united kingdom	152
manchester	1
yemen	189
india	50
hyderabad	1
colombo kollupitiya	1
namibia	10
germany	31
palestinian territories	1
sweden	3
iran	210
kerman	6
lebanon	178
bethlehem	4
nairoukh	1
china	30
italy	10
spain	7
turkey	25
shawawra	1
hague	39
gaza strip	160
khan younis	23
syria	84
mazzeh	2
damascus	17
houthis	3
red sea	250
babelmandeb strait	2
gulf of aden	27
sanaa	15
hodeidah	5
taiz	2
dhamar	1
albayda	2
saada	3
arabian sea	6
bab almandeb strait	12
asia	18
europe	30
kuwait	2
middle east	102
ankara	7
west	24
tehran