### Process of Running GeoTopic Parser:

1. Navigate to directory /Tika_GeoTopic_Parser and run command to start lucene server:
    ```bash
    lucene-geo-gazetteer -server
    ```
    - If the server is running, you should see this: 
        ```bash
        INFO: Starting ProtocolHandler ["http-nio-8765"]
        ```
2. In new terminal, navigate to directory /Tika_GeoTopic_Parser and run command to start geotopic server: 
     ```bash
    ./geotopic-server
    ```
    - If the server is running, you should see this: 
        ```bash
        INFO  [main] 16:25:04,222 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server ff835cb6-9aa1-4817-ba8d-d035eb174c87 at http://localhost:9998/
        ```

3. In new terminal, navigate to directory /Tika_GeoTopic_Parser and run command to test servers: 
    ```bash
        java -cp \
        "tika-build/tika-app-2.7.0.jar:tika-build/tika-parser-nlp-package-2.7.0.jar:./location-ner-model:./geotopic-mime" \
        -Dorg.apache.tika.mime.custom-mimetypes=geotopic-mime/org/apache/tika/mime/custom-mimetypes.xml \
        org.apache.tika.cli.TikaCLI \ -m polar.geot
    ```
    - If command is running correctly, the output should be: 
        ```bash
        United States, 39.76, -98.5    
        ```



In [31]:
import os
from tika import parser
import re
import pandas as pd
import requests

In [34]:
directory_path = '../../Tika_GeoTopic_Parser/Haunted_Places_Text_Files'
tsv_path = '../../data/haunted_places_combine.tsv'
# output_path = '../../data/haunted_places_combine.tsv'
gazetteer_url = "http://localhost:8765/api/search"

In [35]:
df_master_dataset = pd.read_csv(tsv_path, sep='\t', engine='python')
df_master_dataset

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude,Entity Labels,Entity Texts
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.495480,42.960727,"['ORG', 'QUANTITY', 'FAC', 'ORG', 'GPE', 'TIME...","['Ada witch -', '3-mile', 'the Ada Cemetery', ..."
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434,"['DATE', 'DATE']","['in.1 month later', 'this day']"
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547,"['FAC', 'GPE', 'CARDINAL', 'CARDINAL', 'TIME',...","['Gorman Rd', 'Sand Creek', 'one', 'one', 'Lat..."
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547,"['DATE', 'CARDINAL', 'CARDINAL', 'DATE', 'CARD...","['1970', 'one', '211', 'today', 'one', 'two', ..."
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.753030,42.243097,"['ORG', 'CARDINAL']",['Kappa Delta Sorority - The Kappa Delta Soror...
...,...,...,...,...,...,...,...,...,...,...,...,...
10987,Westminster,United States,at 12 midnight you can see a lady with two lit...,city hall,Colorado,CO,-105.048936,39.862610,-105.037205,39.836653,"['TIME', 'CARDINAL', 'PERSON']","['12 midnight', 'two', 'Sheridan St.']"
10988,Westminster,United States,Is haunted by the victims of a murder that hap...,Pillar of Fire,Colorado,CO,-105.032091,39.847237,-105.037205,39.836653,['DATE'],['years ago']
10989,Wheat Ridge,United States,The institution was for kids 18 years old and ...,Ridge Mental Institution,Colorado,CO,-105.063974,39.769726,-105.077206,39.766098,"['DATE', 'DATE', 'CARDINAL', 'CARDINAL']","['18 years old', '70', 'one', 'hundreds']"
10990,Wheat Ridge,United States,Gymnasium - their have been reports of a litt...,Wheat Ridge Middle School,Colorado,CO,-105.103613,39.764055,-105.077206,39.766098,,


In [36]:
cnt = 0
directory = sorted(os.listdir(directory_path))

for file in directory:
    if not file.endswith(".txt"):
        continue

    cnt += 1
    try:
        index = int(file.rsplit('_', 1)[-1].split('.')[0])
    except ValueError:
        print(f"[!] Skipping invalid file: {file}")
        continue

    file_path = os.path.join(directory_path, file)
    name = lat = lon = "NaN"  # Default values

    # --- Step 1: Try extracting metadata with Tika ---
    try:
        parsed = parser.from_file(file_path, headers={"Content-Type": "application/geotopic"})
        metadata = parsed.get("metadata", {})
        name = metadata.get("Geographic_NAME", "NaN")
        lat = metadata.get("Geographic_LATITUDE", "NaN")
        lon = metadata.get("Geographic_LONGITUDE", "NaN")
        print(f"[✓] Tika result → {file}: ({name}, {lat}, {lon})")
    except Exception as e:
        print(f"[X] Tika failed on {file}: {e}")

    # --- Step 2: Fallback to Gazetteer API if Tika fails ---
    if name == "NaN" or lat == "NaN" or lon == "NaN":
        location = None
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                if line.lower().startswith("location:"):
                    location = line.split(":", 1)[1].strip()
                    break

        print(f"[DEBUG] {file} → Fallback location: {location}")

        if location:
            try:
                response = requests.get(gazetteer_url, params={"s": location})
                print(f"[DEBUG] Gazetteer Query: {response.url}")
                print(f"[DEBUG] Status: {response.status_code}")
                print(f"[DEBUG] Response: {response.text}")

                if response.status_code == 200:
                    results = response.json()
                    matched_key = next((k for k in results if k.lower() == location.lower()), None)

                    if matched_key and results[matched_key]:
                        top_place = results[matched_key][0]
                        name = top_place.get("name", matched_key)
                        lat = top_place.get("latitude", "NaN")
                        lon = top_place.get("longitude", "NaN")
                    else:
                        # Try alternative query (e.g., "Location County")
                        location_alt = f"{location} County"
                        response = requests.get(gazetteer_url, params={"s": location_alt})
                        results = response.json()
                        matched_key = next((k for k in results if k.lower() == location_alt.lower()), None)
                        if matched_key and results[matched_key]:
                            top_place = results[matched_key][0]
                            name = top_place.get("name", matched_key)
                            lat = top_place.get("latitude", "NaN")
                            lon = top_place.get("longitude", "NaN")
            except Exception as e:
                print(f"[X] Gazetteer fallback failed for {file}: {e}")

    # --- Step 3: Update the DataFrame ---
    df_master_dataset.loc[index, 'GeoTopic Name'] = name
    df_master_dataset.loc[index, 'GeoTopic Latitude'] = lat
    df_master_dataset.loc[index, 'GeoTopic Longitude'] = lon

    print(f"Processed file {cnt}/{len(directory)}: {file} → ({name}, {lat}, {lon})")


[✓] Tika result → Haunted_Places_0.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_0.txt → Fallback location: Ada Cemetery
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=Ada+Cemetery
[DEBUG] Status: 200
[DEBUG] Response: {"Ada Cemetery":[{"name":"Ada Cemetery","countryCode":"US","admin1Code":"MI","admin2Code":"081","latitude":42.96252,"longitude":-85.50474}]}

Processed file 1/10992: Haunted_Places_0.txt → (Ada Cemetery, 42.96252, -85.50474)
[✓] Tika result → Haunted_Places_1.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_1.txt → Fallback location: North Adams Rd.
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=North+Adams+Rd.
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 2/10992: Haunted_Places_1.txt → (NaN, NaN, NaN)
[✓] Tika result → Haunted_Places_10.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10.txt → Fallback location: The Yellow Motel
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=The+Yellow+Motel
[DEBUG] Status: 200
[DEBUG] Response: {}

  df_master_dataset.loc[index, 'GeoTopic Latitude'] = lat
  df_master_dataset.loc[index, 'GeoTopic Longitude'] = lon


[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=%22Old+Main%22+Administration+Building-+Duquesne+University
[DEBUG] Status: 200
[DEBUG] Response: {"\"Old Main\" Administration Building- Duquesne University":[{"name":"Mecklenburg-Western Pomerania","countryCode":"DE","admin1Code":"12","admin2Code":"","latitude":53.83333,"longitude":12.5}]}

Processed file 5/10992: Haunted_Places_1000.txt → (Mecklenburg-Western Pomerania, 53.83333, 12.5)
[✓] Tika result → Haunted_Places_10000.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10000.txt → Fallback location: St. Charles Nursing Home
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=St.+Charles+Nursing+Home
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 6/10992: Haunted_Places_10000.txt → (NaN, NaN, NaN)
[✓] Tika result → Haunted_Places_10001.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_10001.txt → Fallback location: Lindenwood University
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=Lindenwood+Universi

In [None]:

# # --- Process each TXT file ---
# txt_files = sorted([f for f in os.listdir(directory_path) if f.endswith(".txt")])
# cnt = 0

# for file in txt_files:
#     cnt += 1
#     try:
#         index = int(file.rsplit('_', 1)[-1].split('.')[0])  # Extract index from filename (e.g., "file_123.txt" → 123)
#     except ValueError:
#         print(f"[!] Skipping invalid file: {file}")
#         continue

#     file_path = os.path.join(directory_path, file)
#     name = lat = lon = "NaN"  # Default values

#     # --- Step 1: Try extracting metadata with Tika ---
#     try:
#         parsed = parser.from_file(file_path, headers={"Content-Type": "application/geotopic"})
#         metadata = parsed.get("metadata", {})
#         name = metadata.get("Geographic_NAME", "NaN")
#         lat = metadata.get("Geographic_LATITUDE", "NaN")
#         lon = metadata.get("Geographic_LONGITUDE", "NaN")
#         print(f"[✓] Tika result → {file}: ({name}, {lat}, {lon})")
#     except Exception as e:
#         print(f"[X] Tika failed on {file}: {e}")

#     # --- Step 2: Fallback to Gazetteer API if Tika fails ---
#     if name == "NaN" or lat == "NaN" or lon == "NaN":
#         location = None
#         with open(file_path, "r", encoding="utf-8") as f:
#             for line in f:
#                 if line.lower().startswith("location:"):
#                     location = line.split(":", 1)[1].strip()
#                     break

#         print(f"[DEBUG] {file} → Fallback location: {location}")

#         if location:
#             try:
#                 response = requests.get(gazetteer_url, params={"s": location})
#                 print(f"[DEBUG] Gazetteer Query: {response.url}")
#                 print(f"[DEBUG] Status: {response.status_code}")
#                 print(f"[DEBUG] Response: {response.text}")

#                 if response.status_code == 200:
#                     results = response.json()
#                     matched_key = next((k for k in results if k.lower() == location.lower()), None)

#                     if matched_key and results[matched_key]:
#                         top_place = results[matched_key][0]
#                         name = top_place.get("name", matched_key)
#                         lat = top_place.get("latitude", "NaN")
#                         lon = top_place.get("longitude", "NaN")
#                     else:
#                         # Try alternative query (e.g., "Location County")
#                         location_alt = f"{location} County"
#                         response = requests.get(gazetteer_url, params={"s": location_alt})
#                         results = response.json()
#                         matched_key = next((k for k in results if k.lower() == location_alt.lower()), None)
#                         if matched_key and results[matched_key]:
#                             top_place = results[matched_key][0]
#                             name = top_place.get("name", matched_key)
#                             lat = top_place.get("latitude", "NaN")
#                             lon = top_place.get("longitude", "NaN")
#             except Exception as e:
#                 print(f"[X] Gazetteer fallback failed for {file}: {e}")

#     # --- Step 3: Update the DataFrame ---
#     df_master_dataset.loc[index, 'GeoTopic Name'] = name
#     df_master_dataset.loc[index, 'GeoTopic Latitude'] = lat
#     df_master_dataset.loc[index, 'GeoTopic Longitude'] = lon

#     print(f"Processed file {cnt}/{len(txt_files)}: {file} → ({name}, {lat}, {lon})")

# # --- Save the updated TSV ---
# df_master_dataset.to_csv(output_path, sep='\t', index=False)
# print(f"[✓] Updated TSV saved to: {output_path}")

[✓] Tika result → Haunted_Places_0.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_0.txt → Fallback location: Ada Cemetery
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=Ada+Cemetery
[DEBUG] Status: 200
[DEBUG] Response: {"Ada Cemetery":[{"name":"Ada Cemetery","countryCode":"US","admin1Code":"MI","admin2Code":"081","latitude":42.96252,"longitude":-85.50474}]}

Processed file 1/9: Haunted_Places_0.txt → (Ada Cemetery, 42.96252, -85.50474)
[✓] Tika result → Haunted_Places_1.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_1.txt → Fallback location: North Adams Rd.
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=North+Adams+Rd.
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file 2/9: Haunted_Places_1.txt → (NaN, NaN, NaN)
[✓] Tika result → Haunted_Places_2.txt: (NaN, NaN, NaN)
[DEBUG] Haunted_Places_2.txt → Fallback location: Ghost Trestle
[DEBUG] Gazetteer Query: http://localhost:8765/api/search?s=Ghost+Trestle
[DEBUG] Status: 200
[DEBUG] Response: {}

Processed file

In [40]:
# Step 4: Save output to TSV
df_master_dataset.to_csv(tsv_path, sep="\t", index=False, encoding="utf-8")
print(f"\n Saved updated dataset to: {tsv_path}")



 Saved updated dataset to: ../../data/haunted_places_combine.tsv


In [41]:
df_new = pd.read_csv(tsv_path, sep='\t', engine='python')
df_new

Unnamed: 0,city,country,description,location,state,state_abbrev,longitude,latitude,city_longitude,city_latitude,Entity Labels,Entity Texts,GeoTopic Name,GeoTopic Latitude,GeoTopic Longitude
0,Ada,United States,Ada witch - Sometimes you can see a misty blue...,Ada Cemetery,Michigan,MI,-85.504893,42.962106,-85.495480,42.960727,"['ORG', 'QUANTITY', 'FAC', 'ORG', 'GPE', 'TIME...","['Ada witch -', '3-mile', 'the Ada Cemetery', ...",Ada Cemetery,42.96252,-85.50474
1,Addison,United States,A little girl was killed suddenly while waitin...,North Adams Rd.,Michigan,MI,-84.381843,41.971425,-84.347168,41.986434,"['DATE', 'DATE']","['in.1 month later', 'this day']",,,
2,Adrian,United States,If you take Gorman Rd. west towards Sand Creek...,Ghost Trestle,Michigan,MI,-84.035656,41.904538,-84.037166,41.897547,"['FAC', 'GPE', 'CARDINAL', 'CARDINAL', 'TIME',...","['Gorman Rd', 'Sand Creek', 'one', 'one', 'Lat...",,,
3,Adrian,United States,"In the 1970's, one room, room 211, in the old ...",Siena Heights University,Michigan,MI,-84.017565,41.905712,-84.037166,41.897547,"['DATE', 'CARDINAL', 'CARDINAL', 'DATE', 'CARD...","['1970', 'one', '211', 'today', 'one', 'two', ...",Sterling Heights,42.58031,-83.03020
4,Albion,United States,Kappa Delta Sorority - The Kappa Delta Sororit...,Albion College,Michigan,MI,-84.745177,42.244006,-84.753030,42.243097,"['ORG', 'CARDINAL']",['Kappa Delta Sorority - The Kappa Delta Soror...,Albion College Historical Marker,42.24662,-84.74420
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,Westminster,United States,at 12 midnight you can see a lady with two lit...,city hall,Colorado,CO,-105.048936,39.862610,-105.037205,39.836653,"['TIME', 'CARDINAL', 'PERSON']","['12 midnight', 'two', 'Sheridan St.']",City Hall,51.50492,-0.07867
10988,Westminster,United States,Is haunted by the victims of a murder that hap...,Pillar of Fire,Colorado,CO,-105.032091,39.847237,-105.037205,39.836653,['DATE'],['years ago'],Pillar of Fire Church,41.41787,-75.06767
10989,Wheat Ridge,United States,The institution was for kids 18 years old and ...,Ridge Mental Institution,Colorado,CO,-105.063974,39.769726,-105.077206,39.766098,"['DATE', 'DATE', 'CARDINAL', 'CARDINAL']","['18 years old', '70', 'one', 'hundreds']",,,
10990,Wheat Ridge,United States,Gymnasium - their have been reports of a litt...,Wheat Ridge Middle School,Colorado,CO,-105.103613,39.764055,-105.077206,39.766098,,,,,


In [None]:
# Total entries
total_entries = len(df_new)

# Count valid (non-NaN) lat/lon rows
known_count = (~df_new["GeoTopic Latitude"].isna() & ~df_new["GeoTopic Longitude"].isna()).sum()
unknown_count = total_entries - known_count

# Calculate percentages
known_percentage = (known_count / total_entries) * 100
unknown_percentage = (unknown_count / total_entries) * 100

# Print results
print(f"Entries with valid coordinates: {known_count} ({known_percentage:.2f}%)")
print(f"Entries with NaN coordinates: {unknown_count} ({unknown_percentage:.2f}%)")


Entries with valid coordinates: 4865 (44.26%)
Entries with NaN coordinates: 6127 (55.74%)


Bad pipe message: %s [b'live\r\nAccess-Control-Request-Method: GET\r\nAc', b'ss-Control-Request-Private-Network: true\r\nOrigin: https://api-22627695.duosecurity.com\r\nUser-Agent', b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebK']
Bad pipe message: %s [b'/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36\r\nSec-Fetch-Mode: cors\r\nSec-Fetch-Site: cross-']
