### Process of Running GeoTopic Parser:

1. Navigate to directory /Tika_GeoTopic_Parser and run command to start lucene server:
    ```bash
    lucene-geo-gazetteer -server
    ```
    - If the server is running, you should see this: 
        ```bash
        INFO: Starting ProtocolHandler ["http-nio-8765"]
        ```
2. In new terminal, navigate to directory /Tika_GeoTopic_Parser and run command to start geotopic server: 
     ```bash
    ./geotopic-server
    ```
    - If the server is running, you should see this: 
        ```bash
        INFO  [main] 16:25:04,222 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server ff835cb6-9aa1-4817-ba8d-d035eb174c87 at http://localhost:9998/
        ```

3. In new terminal, navigate to directory /Tika_GeoTopic_Parser and run command to test servers: 
    ```bash
        java -cp \
        "tika-build/tika-app-2.7.0.jar:tika-build/tika-parser-nlp-package-2.7.0.jar:./location-ner-model:./geotopic-mime" \
        -Dorg.apache.tika.mime.custom-mimetypes=geotopic-mime/org/apache/tika/mime/custom-mimetypes.xml \
        org.apache.tika.cli.TikaCLI \ -m polar.geot
    ```
    - If command is running correctly, the output should be: 
        ```bash
        United States, 39.76, -98.5    
        ```



In [12]:
import os
from tika import parser
import re
import pandas as pd
import requests

In [13]:
directory_path = '../../Tika_GeoTopic_Parser/Haunted_Places_Text_Files'
tsv_path = '../../data/haunted_places_entities.tsv'
output_path = '../../data/haunted_places_location.tsv'
gazetteer_url = "http://localhost:8765/api/search"

In [14]:
df_master_dataset = pd.read_csv(tsv_path, sep='\t', engine='python')
df_master_dataset

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude,Entity Labels,Entity Texts
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.495480,42.960727,"['ORG', 'QUANTITY', 'FAC', 'ORG', 'GPE', 'TIME...","['Ada witch -', '3-mile', 'the Ada Cemetery', ..."
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434,"['DATE', 'DATE']","['in.1 month later', 'this day']"
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547,"['FAC', 'GPE', 'CARDINAL', 'CARDINAL', 'TIME',...","['Gorman Rd', 'Sand Creek', 'one', 'one', 'Lat..."
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547,"['DATE', 'CARDINAL', 'CARDINAL', 'DATE', 'CARD...","['1970', 'one', '211', 'today', 'one', 'two', ..."
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.753030,42.243097,"['ORG', 'CARDINAL']",['Kappa Delta Sorority - The Kappa Delta Soror...
...,...,...,...,...,...,...,...,...,...,...,...,...
10987,Westminster,United States,at 12 midnight you can see a lady with two lit...,city hall,Colorado,CO,-105.048936,39.862610,-105.037205,39.836653,"['TIME', 'CARDINAL', 'PERSON']","['12 midnight', 'two', 'Sheridan St.']"
10988,Westminster,United States,Is haunted by the victims of a murder that hap...,Pillar of Fire,Colorado,CO,-105.032091,39.847237,-105.037205,39.836653,['DATE'],['years ago']
10989,Wheat Ridge,United States,The institution was for kids 18 years old and ...,Ridge Mental Institution,Colorado,CO,-105.063974,39.769726,-105.077206,39.766098,"['DATE', 'DATE', 'CARDINAL', 'CARDINAL']","['18 years old', '70', 'one', 'hundreds']"
10990,Wheat Ridge,United States,Gymnasium - their have been reports of a litt...,Wheat Ridge Middle School,Colorado,CO,-105.103613,39.764055,-105.077206,39.766098,,


In [15]:
cnt = 0
directory = sorted(os.listdir(directory_path))

for file in directory:
    if not file.endswith(".txt"):
        continue

    cnt += 1
    try:
        index = int(file.rsplit('_', 1)[-1].split('.')[0])
    except ValueError:
        print(f"[!] Skipping invalid file: {file}")
        continue

    file_path = os.path.join(directory_path, file)

    # Step 1: Extract location
    location = None
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            match = re.search(r"^Location:\s*(.+)$", line, re.IGNORECASE)
            if match:
                location = match.group(1).strip()
                break

    print(f"[DEBUG] {file} → Extracted location: {location}")

    name = lat = lon = "NaN"

    # Step 2: Query Gazetteer
    if location:
        try:
            response = requests.get(gazetteer_url, params={"s": location})
            print(f"[DEBUG] Query: {response.url}")
            print(f"[DEBUG] Status: {response.status_code}")
            print(f"[DEBUG] Response: {response.text}")

            if response.status_code == 200:
                results = response.json()
                matched_key = next((k for k in results if k.lower() == location.lower()), None)

                if matched_key and results[matched_key]:
                    top_place = results[matched_key][0]
                    name = top_place.get("name", matched_key)
                    lat = top_place.get("latitude", "NaN")
                    lon = top_place.get("longitude", "NaN")
                else:
                    # Fallback: Try "Location County"
                    location_alt = f"{location} County"
                    response = requests.get(gazetteer_url, params={"s": location_alt})
                    results = response.json()
                    matched_key = next((k for k in results if k.lower() == location_alt.lower()), None)
                    if matched_key and results[matched_key]:
                        top_place = results[matched_key][0]
                        name = top_place.get("name", matched_key)
                        lat = top_place.get("latitude", "NaN")
                        lon = top_place.get("longitude", "NaN")

        except Exception as e:
            print(f"[X] {file} → Failed to resolve {location}: {e}")

    # Step 3: Save to DataFrame
    df_master_dataset.loc[index, 'GeoTopic Name'] = name
    df_master_dataset.loc[index, 'GeoTopic Latitude'] = lat
    df_master_dataset.loc[index, 'GeoTopic Longitude'] = lon

    print(f"Processed file {cnt}: {file} - ({name}, {lat}, {lon})")

[DEBUG] Haunted_Places_0.txt → Extracted location: Ada Cemetery
[DEBUG] Query: http://localhost:8765/api/search?s=Ada+Cemetery
[DEBUG] Status: 200
[DEBUG] Response: {"Ada Cemetery":[{"name":"Ada Cemetery","countryCode":"US","admin1Code":"MI","admin2Code":"081","latitude":42.96252,"longitude":-85.50474}]}

Processed file 1: Haunted_Places_0.txt - (Ada Cemetery, 42.96252, -85.50474)
[DEBUG] Haunted_Places_1.txt → Extracted location: North Adams Rd.
[DEBUG] Query: http://localhost:8765/api/search?s=North+Adams+Rd.
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 2: Haunted_Places_1.txt - (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10.txt → Extracted location: The Yellow Motel
[DEBUG] Query: http://localhost:8765/api/search?s=The+Yellow+Motel
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 3: Haunted_Places_10.txt - (NaN, NaN, NaN)
[DEBUG] Haunted_Places_100.txt → Extracted location: O.W. Best Middle School
[DEBUG] Query: http://localhost:8765/api/search?s=O.W.+Best+Middle+Scho

  df_master_dataset.loc[index, 'GeoTopic Latitude'] = lat
  df_master_dataset.loc[index, 'GeoTopic Longitude'] = lon


[DEBUG] Haunted_Places_10012.txt → Extracted location: The St. Joseph School District offices
[DEBUG] Query: http://localhost:8765/api/search?s=The+St.+Joseph+School+District+offices
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 19: Haunted_Places_10012.txt - (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10013.txt → Extracted location: The Social Parlor
[DEBUG] Query: http://localhost:8765/api/search?s=The+Social+Parlor
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 20: Haunted_Places_10013.txt - (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10014.txt → Extracted location: Alexian Brothers Hospital Site
[DEBUG] Query: http://localhost:8765/api/search?s=Alexian+Brothers+Hospital+Site
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 21: Haunted_Places_10014.txt - (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10015.txt → Extracted location: The Book House
[DEBUG] Query: http://localhost:8765/api/search?s=The+Book+House
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 22: H

In [16]:
# Step 4: Save output to TSV
df_master_dataset.to_csv(output_path, sep="\t", index=False, encoding="utf-8")
print(f"\n Saved updated dataset to: {output_path}")



 Saved updated dataset to: ../../data/haunted_places_location.tsv


In [10]:
df_new = pd.read_csv(output_path, sep='\t', engine='python')
df_new

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude,Entity Labels,Entity Texts,GeoTopic Name,GeoTopic Latitude,GeoTopic Longitude
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.495480,42.960727,"['ORG', 'QUANTITY', 'FAC', 'ORG', 'GPE', 'TIME...","['Ada witch -', '3-mile', 'the Ada Cemetery', ...",Ada Cemetery,42.96252,-85.50474
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434,"['DATE', 'DATE']","['in.1 month later', 'this day']",,,
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547,"['FAC', 'GPE', 'CARDINAL', 'CARDINAL', 'TIME',...","['Gorman Rd', 'Sand Creek', 'one', 'one', 'Lat...",,,
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547,"['DATE', 'CARDINAL', 'CARDINAL', 'DATE', 'CARD...","['1970', 'one', '211', 'today', 'one', 'two', ...",Siena Heights University,41.90616,-84.01467
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.753030,42.243097,"['ORG', 'CARDINAL']",['Kappa Delta Sorority - The Kappa Delta Soror...,Albion College Historical Marker,42.24662,-84.74420
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,Westminster,United States,at 12 midnight you can see a lady with two lit...,city hall,Colorado,CO,-105.048936,39.862610,-105.037205,39.836653,"['TIME', 'CARDINAL', 'PERSON']","['12 midnight', 'two', 'Sheridan St.']",City Hall,51.50492,-0.07867
10988,Westminster,United States,Is haunted by the victims of a murder that hap...,Pillar of Fire,Colorado,CO,-105.032091,39.847237,-105.037205,39.836653,['DATE'],['years ago'],Pillar of Fire Church,41.41787,-75.06767
10989,Wheat Ridge,United States,The institution was for kids 18 years old and ...,Ridge Mental Institution,Colorado,CO,-105.063974,39.769726,-105.077206,39.766098,"['DATE', 'DATE', 'CARDINAL', 'CARDINAL']","['18 years old', '70', 'one', 'hundreds']",,,
10990,Wheat Ridge,United States,Gymnasium - their have been reports of a litt...,Wheat Ridge Middle School,Colorado,CO,-105.103613,39.764055,-105.077206,39.766098,,,,,


In [17]:
# Total entries
total_entries = len(df_new)

# Count valid (non-NaN) lat/lon rows
known_count = (~df_new["GeoTopic Latitude"].isna() & ~df_new["GeoTopic Longitude"].isna()).sum()
unknown_count = total_entries - known_count

# Calculate percentages
known_percentage = (known_count / total_entries) * 100
unknown_percentage = (unknown_count / total_entries) * 100

# Print results
print(f"Entries with valid coordinates: {known_count} ({known_percentage:.2f}%)")
print(f"Entries with NaN coordinates: {unknown_count} ({unknown_percentage:.2f}%)")


Entries with valid coordinates: 3912 (35.59%)
Entries with NaN coordinates: 7080 (64.41%)
