Continuing to use the Flair HuggingFace model for detecting locations.

Prerequisites:
- Download the punkt NLTK tokenizer models, as well as the `punkt_tab` package. You can do that by typing `nltk.download('punkt')`, and `nltk.download("punkt_tab")`, respectively, into a code cell.

## Imports & setup

In [10]:
import nltk
from nltk.tokenize import sent_tokenize

In [11]:
from flair.models import SequenceTagger
from flair.data import Sentence
from flair.nn import Classifier
import string

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
import pandas as pd

In [13]:
loc_tagger = SequenceTagger.load("Saisam/Inquirer_ner_loc") # initialize the model

2025-10-31 01:39:33,421 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


## Load data

In [3]:
# Read the full dataframe where the data was partially cleaned
df = pd.read_csv("cleaned_police_reports.csv")
df.head()

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
1,"Guled Abdullahi, badge #706",Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne..."
2,"Dean V. Albers, badge #None",Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D..."
3,"Scott Aikins, badge #22",Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14..."
4,"Matthew Aish, badge #None",Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...


In [4]:
# Read the small dataframe of manually cleaned articles
small_df = pd.read_csv("manually_cleaned_police_reports_small.csv")
small_df.head()

Unnamed: 0,name,department,url,text
0,"Jeffrey Pennaz, badge #5551",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Witnesses say the stop happened around 8:30 p....
1,"Kurt Radke, badge #5882",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00241 RCA Legal S...
2,"Craig A. Taylor, badge #7139",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00233 RCA Legal S...
3,"Cory Taylor, badge #7141",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00240 RCA Legal S...
4,"Joseph Will, badge #7749",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00230 RCA Legal S...


In [5]:
unique_texts_full = pd.DataFrame(df["text"].unique(), columns = ["text"]) # get unique articles and save them as a csv
unique_texts_full.to_csv("unique_texts_full.csv", index=False)

In [6]:
# I asked ChatGPT which texts in `unique_texts_full` have valid characters. These are the indices of those texts. 
valid_indices = [32, 46, 47, 72, 78, 83, 90, 96, 102, 109, 110, 112, 120, 122, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257]

valid_texts = unique_texts_full.iloc[valid_indices, :].reset_index(drop=True) # get all the valid texts
valid_texts

Unnamed: 0,text
0,CASE 0:13-cv-00435-DSD-JJK Document 1 File...
1,OAO450 (Rev. 5/85) Judgment in a Civil Case ...
2,CASE 0:10-cv-02092-ADM-TNL Doc. 12 Filed 0...
3,How to Spend Millions Settling Lawsuits City P...
4,Exhibit B CASE 0:08-cv-00600-DWF-AJB Documen...
...,...
142,Minneapolis City of Lakes Police Department ...
143,3/18/2021 Police: Ofﬁcers Involved In Shooting...
144,(https://lp.ﬁndlaw.com/) FINDLAW (HTTPS://LP.F...
145,State of Minnesota District Court County of No...


In [7]:
# Pick a few documents to test the NER model on. 
# Try to find documents that are not court cases by filtering out the word "case"
# NOTE: it's not guaranteed that these documents are all not cases. I just did a quick filtering on the dataset.
# We would need some other way to check which ones are cases.
non_cases = valid_texts.loc[~valid_texts["text"].str.lower().str.contains("case"), "text"]

for t in non_cases.values:
    print(f"{t}\n")

that she didn't think she had used force. but that  1 3.43  still wanted her to do a use  of force report.  SUMMARY OF ALLEGATIONS AND RECOMMENDATION  Allegations  Investigative Facts  Recommendation  13.43 - Personnel Data  12  1212297 13.43 - Personnel Data  13  1212298


St. Paul Police Department PERSONNEL PROFILE FOR (236500) - BARRAGAN, PAMELA MARGARITA 503 10/04/2022 1 Page Promotional History Date Rank Title Type 10/14/2017 4 Commander             Advance             01/07/2006 2 Sergeant             Advance             09/11/1999 1 Police Officer             Certified             03/01/1996 1 Comm Liaisn Off             Certified             St. Paul Police Department PERSONNEL PROFILE FOR (236500) - BARRAGAN, PAMELA MARGARITA 503 10/04/2022 2 Page Date Unit Assignment History 03/30/2019 Community Partnerships           10/14/2017 City-Wide Services             10/01/2016 Community Engagement             09/19/2015 Central District             12/14/2013 Sexual Violence       

In [8]:
# pick the last 5 texts that showed up above
valid_texts_small = non_cases.iloc[-5:].tolist()
valid_texts_small

['3/18/2021 Minneapolis ofﬁcer wounds man suspected in shooting - StarTribune.com https://www.startribune.com/minneapolis-ofﬁcer-wounds-man-suspected-in-shooting/127297968/ 1/2 MINNEAPOLIS Minneapolis officer wounds man suspected in shooting Man shot someone and minutes later is shot himself by Minneapolis police.  By Matt McKinney (https://www.startribune.com/matt-mckinney/6370539/) and PAUL Walsh AUGUST 9, 2011 — 9:59PM A man suspected of shooting someone at a south Minneapolis public housing complex Monday night was confronted and shot minutes later by Minneapolis police, authorities said. As he recovered from a police gunshot at Hennepin County Medical Center, Leroy Martinez, 23, of 1426 Penn Av. N., was charged Tuesday with second-degree assault in connection with a shooting near the playground of the Little Earth of United Tribes public housing complex. Like Martinez, the victim in that shooting is expected to live. The shootings took place around 11:15 p.m., when two off-duty of

In [9]:
pd.DataFrame(valid_texts_small, columns=["text"]).to_csv("valid_texts_small.csv", index=False)

## Helper functions

In [15]:
def split_text(text: str, method: str, n: int = -1) -> list[Sentence]:
    """ 
    Split a string of text into chunks with the specified method.

    Parameters:
        - text: the text to split
        - method: How to split the text. To split by sentences, pass `method = "sentences"`. 
                  To split the text every `n` words, pass `method = "every_n_words"`, and pass the number of words for `n`. 
        - n: if you're splitting the text by words, this is the number of words in each chunk of text.

    Returns: a list of Flair sentence objects, in order, each one representing a chunk
    """

    sentences = [] # list of Flair Sentence objects, one for each chunk of text.
    
    # Split every n words
    if method == "every_n_words":
        words = text.split() # split the text into a list of words by splitting it on every whitespace
        chunks = [' '.join(words[i:i + n]) for i in range(0, len(words), n)] # split text into a list of chunks of n words
        sentences = [Sentence(chunk) for chunk in chunks] # make each chunk into a Flair Sentence object

    # Split by sentences using a sentence tokenizer
    else:
        sentence_texts = sent_tokenize(text) # use a tokenizer to get a list of sentences, as strings
        sentences = [Sentence(text) for text in sentence_texts] # make each sentence string into a Flair Sentence object

    # Link the sentences so that each Sentence object has a pointer to the previous sentence and the next sentence, to preserve context information.
    Sentence.set_context_for_sentences(sentences) 
    return sentences

In [16]:
def post_process(locations: list[str]):
    """ 
    Post-process the predicted locations by removing punctuation and whitespace.

    Parameters:
        - locations: list of locations predicted by the NER model
    
    Returns: the post-processed locations, in the same order as they were passed in.
    """    

In [17]:
def predict_locations(model, text: str, splitting_method: str, n: int = -1) -> pd.DataFrame:
    """ 
    Predicts all the locations in a piece of text using NER.

    Parameters:
        - model: The model to use for NER
        - text: the text to split
        - method: How to split the text. To split by sentences, pass `method = "sentences"`. 
                  To split the text every `n` words, pass `method = "every_n_words"`, and pass the number of words for `n`. 
                  If you don't want to split the text and want the model to run the prediction on the whole test as one chunk, pass `None`. 
        - n: if you're splitting the text by words, this is the number of words in each chunk of text.

    Returns: a DataFrame containing a column with all the predicted locations, and a column with the confidence score for each location.
    """

    # Get the text or list of texts that will be passed to the model for prediction.
    # If a splitting method is specified, get the splitted text as a list of Flair Sentence objects. Otherwise, make one Sentence object for the whole text.
    text_chunks = split_text(text, method = splitting_method, n = n) if splitting_method is not None else Sentence(text)

    model.predict(text_chunks)

    locations = [] # all locations predicted by the model
    scores = [] # confidence score for each location, in the same order as the locations.
    
    # If we passed a list of sentences/chunks to the model, loop through each chunk.
    if splitting_method is not None:
        for chunk in text_chunks:
            # print(chunk.text)

            for label in chunk.get_labels():
                # print(f"\tLoc: {label.data_point.text} | Score: {label.score.__round__(3)}\n")

                # append location and score to list
                locations.append(label.data_point.text)
                scores.append(label.score)


    # If we didn't split the text and passed only one chunk/sentence to the model, print just that chunk.
    else:
        # print(f"{text_chunks.text}\n")
        for label in text_chunks.get_labels():
            # print(f"\tLoc: {label.data_point.text} | Score: {label.score.__round__(3)}\n")

            locations.append(label.data_point.text)
            scores.append(label.score)

    return pd.DataFrame.from_dict({"location": locations, "score": scores})

In [18]:
def print_text(text):
    """
    Print an article text in a more readable way. 
    """

    sentence_texts = sent_tokenize(text) # split by sentences
    clean_text = "\n".join(sentence_texts) # Make a string with each sentence on its own line
    print(clean_text)

## Compare sentence-based vs. every 50 words splitting

### Try a sample text

In [13]:
text = df.loc[7, "text"] # try a sample text

In [None]:
# Get the locations and confidence scores for splitting every n words
word_results = predict_locations(loc_tagger, text, splitting_method = "every_n_words", n = 50) # split every n words
word_results

In [None]:
sentence_results = predict_locations(loc_tagger, text, splitting_method = "sentences") # split by sentences
sentence_results

Bug: The model should predict "Minnesota Bureau of Criminal Apprehension", not "Minnesota Bureau". 

The NLTK sentence tokenizer seems better!

### Try all the texts in `small_valid_texts`

In [19]:
for i, text in enumerate(valid_texts_small):
    # NER results when splitting every 50 words. Rename columns for clarity
    word_results = predict_locations(loc_tagger, text, splitting_method = "every_n_words", n = 50).sort_values(by="location").rename(columns={"location": "location_every_n_words", "score": "score_every_n_words"})
    
    sentence_results = predict_locations(loc_tagger, text, splitting_method = "sentences").sort_values(by="location").rename(columns={"location": "location_by_sentence", "score": "score_by_sentence"}) # NER results when splitting by sentences
    combined_df = pd.concat([word_results, sentence_results], axis = 1) # combine the results for each chunking strategy together

    print(f"{'-'*50} Text {i}: {'-'*50}")
    display(combined_df)

-------------------------------------------------- Text 0: --------------------------------------------------


Unnamed: 0,location_every_n_words,score_every_n_words,location_by_sentence,score_by_sentence
7,"1426 Penn Av. N.,",0.894913,1426 Penn Av.,0.96145
9,25th Street,0.992111,playground of the Little Earth of United Tribe...,0.843144
10,Cedar Avenue S.,0.996001,housing complex,0.521168
6,"Hennepin County Medical Center,",0.973894,"Hennepin County Medical Center,",0.935269
15,Little Earth,0.834562,"Little Earth of United Tribes,",0.989663
14,"Little Earth of United Tribes,",0.990928,Minneapolis,0.989954
12,"Little Earth of United Tribes,",0.985897,Minneapolis,0.995214
2,MINNEAPOLIS Minneapolis,0.954126,MINNEAPOLIS Minneapolis,0.951128
13,Minneapolis,0.995321,"Little Earth of United Tribes,",0.991427
11,Minneapolis,0.989535,25th Street and Cedar Avenue S.,0.920185


-------------------------------------------------- Text 1: --------------------------------------------------


Unnamed: 0,location_every_n_words,score_every_n_words,location_by_sentence,score_by_sentence
2,Adrian,0.986094,Adrian,0.984677
3,"Adrian Police Department,",0.988747,"Adrian Police Department,",0.981301
0,Minnesota,0.982399,Minnesota,0.982367
4,Minnesota,0.981672,Minnesota,0.987426
5,Nobles County Sheriﬀ's Oﬃce,0.784258,Nobles County,0.870338
6,Sioux Falls,0.984438,Sioux Falls,0.974456
1,"Sioux Falls, SD ",0.955082,"Sioux Falls, SD ",0.954516


-------------------------------------------------- Text 2: --------------------------------------------------


Unnamed: 0,location_every_n_words,score_every_n_words,location_by_sentence,score_by_sentence
8,Adrian,0.978389,Adrian,0.970835
7,Adrian Police,0.795898,Minnesota,0.990701
20,Compass Center,0.973466,Compass Center,0.986615
17,Compass Center,0.980182,Compass Center,0.950269
38,"Gray Media Group, Inc",0.82083,Station -,0.90296
4,Minnesota,0.989413,KSFY),0.632842
6,Minnesota,0.991308,Minnesota-police-officer-arrested-for-assaulti...,0.66712
36,Minnesota,0.979073,Minnesota-police-officer-arrested-for-assaulti...,0.694128
29,Minnesota,0.988719,Minnesota-police-officer-arrested-for-assaulti...,0.631191
27,Minnesota,0.987814,Sioux Falls,0.989319


-------------------------------------------------- Text 3: --------------------------------------------------


Unnamed: 0,location_every_n_words,score_every_n_words,location_by_sentence,score_by_sentence
3,4050 Bryant Avenue North.,0.934791,4050 Bryant Avenue North.,0.913917
0,Minneapolis City,0.922386,Minneapolis City,0.922906
7,Perry,0.619125,hall,0.623486
4,alley behind,0.797496,alley behind,0.812874
5,"ashtray,""",0.868249,"ashtray,""",0.949247
9,city,0.727575,city,0.681478
10,city,0.706236,,
6,"city attorney's office,",0.59625,"city attorney's office,",0.891983
8,city hall,0.627644,city,0.664531
2,north side,0.976661,north side,0.978258


-------------------------------------------------- Text 4: --------------------------------------------------


Unnamed: 0,location_every_n_words,score_every_n_words,location_by_sentence,score_by_sentence
24,.,0.677803,South Minneapolis,0.99317
20,California,0.979523,Minnesota News Minneapolis News St. Paul News,0.874
22,George Floyd Square',0.946647,Gophers High School Sports Rally,0.962127
4,"George Floyd Square,",0.992809,San Francisco Gov,0.843397
19,Gophers High School Sports Rally,0.841245,minnesota.cbslocal.com,0.595568
25,Hennepin County,0.967618,George Floyd Square',0.965542
27,Hennepin County Medical Center.,0.931269,Alaskan,0.795128
6,I-94,0.987459,"George Floyd Square,",0.993472
11,Little Earth residential community,0.966483,MINNEAPOLIS (,0.835805
13,"Little Earth residential community,",0.969297,Little Earth residential community,0.961568
