# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [7]:
!pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook.

In [8]:
import stanza
import os
import re



## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [9]:
stanza.download("en")
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


#Use the corpus in the portfolio repo
Instead of using the corpus texts in the “session_10.1” folder, your script should use
the (much larger) corpus folder in your fork of the portfolio repository.

In [10]:
!git clone https://github.com/jafar756/FASDH25-portfolio2.git


Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4386, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 4386 (delta 10), reused 7 (delta 7), pack-reused 4370 (from 2)[K
Receiving objects: 100% (4386/4386), 17.82 MiB | 22.59 MiB/s, done.
Resolving deltas: 100% (19/19), done.


#Extract only the place names from the articles written in January 2024
Use a condition to make sure that only place names from articles written in January
2024 are extracted.

In [11]:
# Now I will Define the path where the text files (articles) are stored
corpus_path = "/content/FASDH25-portfolio2/articles"
# start an empty list to store paths of matching
jan_files = []
# Now going through the directory and its subdirectories
for root, _, files in os.walk(corpus_path):
  # Loop through all files in the current directory
    for file in files:
      # Check if the file is from January 2024 and ends with '.txt'
        if '2024-01' in file and file.endswith('.txt'):
            jan_files.append(os.path.join(root, file))

# print the total number of matching files found
print(f"Found {len(jan_files)} January 2024 files.")


Found 326 January 2024 files.


# Place names : Count the number times each named entity that refers to a place is mentioned
in these texts

In [17]:
# Step 2: Create a dictionary to store place counts
place_counts = {}

# Step 3: Set your folder path (use raw string r"" to avoid issues with backslashes)
corpus_path = "/content/FASDH25-portfolio2/articles"


# Step 5: Process each file
for filepath in jan_files:
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read()
        doc = nlp(text)
        for ent in doc.ents:
            if ent.type in ['GPE', 'LOC']:
                place_name = ent.text.strip()
                if place_name in place_counts:
                    place_counts[place_name] += 1
                else:
                    place_counts[place_name] = 1

# Step 6: Print all place names and counts
for place, count in place_counts.items():
    print(f"{place}: {count}")


Morocco: 13
Israel: 1593
Gaza: 1605
Rabat: 3
United States: 40
the United Arab Emirates: 13
UAE: 7
Bahrain: 11
Sudan: 3
US: 706
Western Sahara: 3
Washington: 60
Tel Aviv: 49
Algeria: 7
Marrakesh: 1
the Western Sahara: 1
Morocco’s: 1
Maghreb: 1
Ukraine: 47
Saudi Arabia: 39
California: 3
West Bank: 120
Dena: 1
Israel’s: 31
Oakland: 1
the United States: 97
South Africa: 200
Jordan: 42
Jerusalem: 26
East Jerusalem: 23
Egypt: 43
Qatar: 64
Kuala Lumpur: 4
Malaysia: 8
Palestine: 124
Indonesia’s: 1
Jakarta: 2
Johannesburg: 4
London: 17
Paris: 8
Vienna: 1
Berlin: 5
Amman: 6
Washington DC: 3
UK: 95
Manchester: 1
Yemen: 182
Washington, DC: 4
India: 50
Hyderabad: 1
Colombo’s Kollupitiya: 1
Namibia: 10
Germany: 31
Palestinian Territories: 1
Sweden: 2
Iran: 206
Kerman: 6
Lebanon: 175
Bethlehem: 4
Nairoukh: 1
China: 28
Italy: 10
Spain: 7
Turkey: 25
Shawawra: 1
The Hague: 33
South Africa’s: 8
the Gaza Strip: 123
Khan Younis: 23
Syria: 83
Mazzeh: 2
Damascus: 17
U.S.: 11
Houthis’: 3
the Red Sea: 194
the

#Clean up the named entity names: check if the data contains duplicates and merge
the duplicates, using conditions: e.g., add the count for “Gaza’s” to “Gaza” and
remove “Gaza’s” from the dictionary.

In [18]:
# Initialize Stanza pipeline
# stanza.download('en')
# nlp = stanza.Pipeline('en', processors='tokenize,ner')

def normalize_place_name(place):
    """Normalize place names using standardized naming conventions"""
    place = place.strip()

    # Remove common prefixes and suffixes
    place = re.sub(r'^the\s+', '', place, flags=re.IGNORECASE)
    place = re.sub(r'[\'’]s', '', place)

    # Standard naming conventions dictionary
    standard_names = {
        # Region normalizations
        'gaza': 'Gaza',  # Catches all Gaza variants

        # Country abbreviations
        'US': 'United States',
        'U.S.': 'United States',
        'USA': 'United States',
        'UK': 'United Kingdom',
        'UAE': 'United Arab Emirates',
        'Britain': 'United Kingdom',

        # Official names to common names
        'State of Israel': 'Israel',
        'Islamic Republic of Iran': 'Iran',
        'Republic of Yemen': 'Yemen',
        'State of Palestine': 'Palestine',

        # Common misspellings
        'Beruit': 'Beirut',
        'Dahiyeb': 'Dahiyeh',
        'Tel Israel': 'Tel Aviv',

        # Sub-region normalizations
        'WestBank': 'West Bank',
        'Gaza Strip': 'Gaza',
        'Gaza City': 'Gaza'
    }

    # Check for Gaza first (special case)
    if re.search(r'gaza', place.lower()):
        return standard_names['gaza']

    # Return standardized name if exists, otherwise original
    return standard_names.get(place, place)

# Initialize places dictionary
places = {}

folder = "/content/FASDH25-portfolio2/articles"

for filename in os.listdir(folder):
    if filename.startswith("2024-01-"):
        path = os.path.join(folder, filename)
        with open(path, encoding="utf-8") as file:
            text = file.read()
        doc = nlp(text)

        for sentence in doc.sentences:
            for ent in sentence.ents:
                if ent.type in ["GPE", "LOC"]:
                    normalized = normalize_place_name(ent.text)
                    if normalized in places:
                        places[normalized] += 1
                    else:
                        places[normalized] = 1

print(places)


{'Morocco': 14, 'Israel': 1632, 'Gaza': 1830, 'Rabat': 3, 'United States': 877, 'United Arab Emirates': 21, 'Bahrain': 11, 'Sudan': 3, 'Western Sahara': 4, 'Washington': 62, 'Tel Aviv': 52, 'Algeria': 7, 'Marrakesh': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 164, 'Dena': 1, 'Oakland': 1, 'South Africa': 208, 'Jordan': 43, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 44, 'Qatar': 65, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 125, 'Indonesia': 3, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 3, 'United Kingdom': 152, 'Manchester': 1, 'Yemen': 189, 'Washington, DC': 4, 'India': 50, 'Hyderabad': 1, 'Colombo Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 3, 'Iran': 210, 'Kerman': 6, 'Lebanon': 178, 'Bethlehem': 4, 'Nairoukh': 1, 'China': 30, 'Italy': 10, 'Spain': 7, 'Turkey': 25, 'Shawawra': 1, 'Hague': 39, 'Khan Younis': 23, 'Syria': 84

### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?

Fill in the blanks:

Storing tsv

In [19]:
# Define the name and path of the output file
filename = "/content/FASDH25-portfolio2/ner_counts.tsv"

# Open the file in writing mode using UTF-8 encoding
with open(filename, mode="w", encoding="utf-8") as file:
    # Create the header line: column names separated by a tab
    header = "Place\tCount\n"
    file.write(header)

    # Loop through the places dictionary (cleaned place names and their counts)
    for place, count in places.items():
        # Create a row with the place and count separated by a tab
        row = f"{place}\t{count}\n"
        file.write(row)

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [23]:
# Open the file from the path where it's stored
with open("/content/FASDH25-portfolio2/ner_counts.tsv", encoding="utf-8") as file:
    print(file.read())

Place	Count
Morocco	14
Israel	1632
Gaza	1830
Rabat	3
United States	877
United Arab Emirates	21
Bahrain	11
Sudan	3
Western Sahara	4
Washington	62
Tel Aviv	52
Algeria	7
Marrakesh	1
Maghreb	1
Ukraine	47
Saudi Arabia	39
California	3
West Bank	164
Dena	1
Oakland	1
South Africa	208
Jordan	43
Jerusalem	26
East Jerusalem	23
Egypt	44
Qatar	65
Kuala Lumpur	4
Malaysia	8
Palestine	125
Indonesia	3
Jakarta	2
Johannesburg	4
London	17
Paris	8
Vienna	1
Berlin	5
Amman	6
Washington DC	3
United Kingdom	152
Manchester	1
Yemen	189
Washington, DC	4
India	50
Hyderabad	1
Colombo Kollupitiya	1
Namibia	10
Germany	31
Palestinian Territories	1
Sweden	3
Iran	210
Kerman	6
Lebanon	178
Bethlehem	4
Nairoukh	1
China	30
Italy	10
Spain	7
Turkey	25
Shawawra	1
Hague	39
Khan Younis	23
Syria	84
Mazzeh	2
Damascus	17
Houthis’	3
Red Sea	249
Bab-el-Mandeb Strait	1
Gulf of Aden	27
Sanaa	15
Hodeidah	5
Taiz	2
Dhamar	1
al-Bayda	1
Saada	3
Arabian Sea	6
Bab al-Mandeb Strait	9
Asia	18
Europe	30
Kuwait	2
Middle East	102
Ankara	7
West	24


Now, reuse the code above to get the coordinates for the place names from the places we stored in the `ner_counts.tsv` file.

Write a new tsv file, `ner_gazetteer.tsv`, which contains three columns: name, latitude, longitude.