# Geoparsing

[![colab badge](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mcallaghan/NLP-climate-science-tutorial-CCAI/blob/main/F_geoparse_texts.ipynb)

The last thing we want to do with our texts is to geoparse them. This involves two steps: extracting place names and resolving these to the structured geographic information. The [Mordecai](https://github.com/openeventdata/mordecai) library does this, with the help of some neural networks to resolve combined_place_df names to the correct combined_place_df based on the context. You will want to install mordecai in a separate virtual environment - make sure this environment is using the latest version of pip, to make sure tensorflow gets installed correctly. Some people have had issues running Mordecai on Macs - in case this is not working, the output of this file is included.

First we will load the data and merge it with the predictions. We only want to run the parser on documents predicted to be relevant

In [1]:
import pandas as pd
import re
import os

## If we are running in colab, mount google drive and change into the directory we cloned the repository into
if os.path.exists("/content/"):
    from google.colab import drive
    import os
    drive.mount('/content/drive')
    os.chdir("/content/drive/MyDrive/NLP-climate-science-tutorial-CCAI") 

from D_run_cv_experiments import load_data

df = load_data(False)
df.loc[pd.isna(df["id"]), "id"] = df.loc[pd.isna(df["id"]), "OA_id"]

# Merge data with predictions
df = df.merge(pd.read_csv('cv_data/INCLUDE/predictions_5_splits.csv'), how="outer")
# Where we have no prediction, put the actual label in the prediction column.
df.loc[pd.isna(df["INCLUDE_prediction"]),"INCLUDE_prediction"] = df.loc[pd.isna(df["INCLUDE_prediction"]),"INCLUDE"]
print(df.shape)
df = df[(df["INCLUDE_prediction"]>=0.5)]
print(df.shape)
df.head()


(15636, 16)
(14442, 16)


Unnamed: 0,id,abstract,title,seen,INCLUDE,12 - Coastal and marine Ecosystems,12 - Human and managed,"12 - Mountains, snow and ice","12 - Rivers, lakes, and soil moisture",12 - Terrestrial ES,title_lcase,OA_id,doi,publication_year,authors,INCLUDE_prediction
0,https://openalex.org/W2018832642,The analysis of possible regional climate chan...,An inter-comparison of regional climate models...,0.0,,,,,,,anintercomparisonofregionalclimatemodelsforeur...,https://openalex.org/W2018832642,https://doi.org/10.1007/s10584-006-9213-4,2007.0,"Daniela Jacob, Lars Bärring, Ole Bøssing Chris...",0.515179
2,468699.0,The processes influencing the magnitude of Wes...,Climatic Controls on West Nile Virus and Sindb...,1.0,1.0,0.0,1.0,0.0,0.0,0.0,climaticcontrolsonwestnilevirusandsindbisvirus...,,,,,1.0
7,1284550.0,The long-term history of fire regimes in the M...,Coupled human-climate signals on the fire hist...,1.0,1.0,0.0,0.0,0.0,0.0,1.0,coupledhumanclimatesignalsonthefirehistoryofup...,,,,,1.0
10,695403.0,"Since the late 1940s, snowmelt and runoff have...",LARGE-SCALE ATMOSPHERIC FORCING OF RECENT TREN...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,largescaleatmosphericforcingofrecenttrendstowa...,,,,,1.0
11,1464308.0,Exploring the relationship between hydrologica...,Non-linear relationship of hydrological drough...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,nonlinearrelationshipofhydrologicaldroughtresp...,,,,,1.0


In [2]:
# We'll also load any combined_place_df that have already been processed, or initialise an empty dataframe
if os.path.exists("data/combined_place_df.csv"):
    processed_place_df = pd.read_csv("data/combined_place_df.csv")
    unprocessed_place_df = df[~df['id'].isin(processed_place_df)]
else:
    processed_place_df = pd.DataFrame()
    unprocessed_place_df = df

In [3]:
# When we run the geoparser on a string, we get nice structured geographical information
from mordecai import Geoparser
geo = Geoparser()
geo.geoparse("I travelled from Oxford to Ottawa")

2022-08-16 10:54:40.148846: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-16 10:54:40.148869: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Models path: /home/max/software/mordecai-env/lib/python3.9/site-packages/mordecai/models/


2022-08-16 10:55:15.653547: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-08-16 10:55:15.653775: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-16 10:55:15.654179: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (max-ThinkPad-X280): /proc/driver/nvidia/version does not exist
2022-08-16 10:55:15.655853: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.




[{'word': 'Oxford',
  'spans': [{'start': 17, 'end': 23}],
  'country_predicted': 'GBR',
  'country_conf': 0.95718795,
  'geo': {'admin1': 'England',
   'lat': '51.75222',
   'lon': '-1.25596',
   'country_code3': 'GBR',
   'geonameid': '2640729',
   'place_name': 'Oxford',
   'feature_class': 'P',
   'feature_code': 'PPLA2'}}]

In [4]:
%%capture 
places = []
geos = []

import re

# Go through the rows of the dataframe
for i, row in unprocessed_place_df.iterrows():
    
    # Get the text we want to geoparse, join title and abstract, get rid of copyright stuff
    t = str(row['title']) + " " + str(row['abstract'])
    t = t.split("Copyright (C)")[0] 
    t = re.split("\([C-c]\) [1-2][0-9]{3} Elsevier",t)[0] 
    t = t.split("Published by Elsevier")[0] 
    t = t.split("Copyright. (C)")[0] 
    t = re.split("\. \(C\) [1-2][0-9]{3} ",t)[0] 
    t = re.split("\. \(C\) Copyright",t)[0]   
    t = re.split("\. \\xA9 [1-2][0-9]{3}", t)[0] #Copyright symbol
    
    # Remove some common place names involved in environmental studies
    t = re.sub("paris agreement", "", t, flags=re.I)
    t = re.sub("kyoto protocol", "", t, flags=re.I)
    t = re.sub("montreal protocol", "", t, flags=re.I)
    t = re.sub("london protocol", "", t, flags=re.I)
    
    # geoparse
    gp = geo.geoparse(t)
    
    # For each place, append to a list of dictionaries, with a field for the doc_id
    for p in gp:
        if "geo" in p:
            for key, value in p["geo"].items():
                p[key] = value
            del p["geo"]
            
        p["doc_id"] = row["id"]
        places.append(p)

    # Save this every thousand rows, so we don't need to start again if we get interrupted
    if i % 1000 == 0:
        combined_place_df = processed_place_df.append(pd.DataFrame.from_dict(places))
        print(combined_place_df.shape)
        combined_place_df.to_csv("data/combined_place_df.csv", index=False)
    
# Merge all the data together
combined_place_df = processed_place_df.append(pd.DataFrame.from_dict(places))
print(combined_place_df.shape)
combined_place_df.to_csv("data/combined_place_df.csv", index=False)

## Cleaning geoparsing output

The output from Mordecai has some common errors. Some of the ones we have identified are fixed below

In [5]:
df['tstring'] = df['title'] + " " + df['abstract']

gm_docs = df.loc[
    (df['tstring'].str.lower().str.contains("gulf of mexico")),
    "id"
]
geocolumns = ["word", "country_conf", "feature_code","lat","lon","place_name","feature_class","geonameid"]
gm = pd.DataFrame({"doc_id": gm_docs})
gm[geocolumns] = ["Gulf of Mexico",0.8,"GULF", 25, -90, "Gulf of Mexico", "H", 3523271]

combined_place_df = pd.concat([combined_place_df, gm])


lab_docs = df.loc[
    (df['tstring'].str.lower().str.contains("labrador sea")),
    "id"
]
geocolumns = ["word", "country_conf", "feature_code","lat","lon","place_name","feature_class","geonameid"]
lab = pd.DataFrame({"doc_id": lab_docs})
lab[geocolumns] = ["Labrador Sea",0.8,"SEA", 57, -55, "Labrador Sea", "H", 3424929]

combined_place_df = pd.concat([combined_place_df, lab])

baf_docs = df.loc[
    (df['tstring'].str.lower().str.contains("baffin bay")),
    "id"
]
geocolumns = ["word", "country_conf", "feature_code","lat","lon","place_name","feature_class","geonameid"]
baf = pd.DataFrame({"doc_id": baf_docs})
baf[geocolumns] = ["Baffin Bay",0.8,"BAY", 74, -68, "Baffin Bay", "H", 3831554]
combined_place_df = pd.concat([combined_place_df, baf])


ok_docs = df.loc[
    (df['tstring'].str.lower().str.contains("sea of okhotsk")) ,
    "id"
]
geocolumns = ["word", "country_conf", "feature_code","lat","lon","place_name","feature_class","geonameid"]
ok = pd.DataFrame({"doc_id": ok_docs})
ok[geocolumns] = ["Sea of Okhotsk",0.8, "SEA", 55, 150, "Sea of Okhotsk", "H", 2127380]
combined_place_df = pd.concat([combined_place_df, ok])

# Drop


kyoto_docs = df.loc[
    (df['tstring'].str.lower().str.contains("kyoto target")) |
    (df['tstring'].str.lower().str.contains("kyoto process")) |
    (df['tstring'].str.lower().str.contains("kyoto emission")) |
    (df['tstring'].str.lower().str.contains("kyoto gas")) |
    (df['tstring'].str.lower().str.contains("kyoto agreement")) |
    (df['tstring'].str.lower().str.contains("kyoto protocol")) |
    (df['tstring'].str.lower().str.contains("kyoto framework")),
    "id"
]

combined_place_df = combined_place_df.drop(combined_place_df[(combined_place_df['doc_id'].isin(kyoto_docs)) & (combined_place_df['word'].str.lower()=="kyoto")].index)

paris_docs = df.loc[
    (df['tstring'].str.contains('(Paris(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Paris)')) |
    (df['tstring'].str.contains('(Paris(?:\S* ){0,15}Agreement)|(COP(?:\S* ){0,15}Agreement)')) ,
    'id'
]
combined_place_df = combined_place_df.drop(combined_place_df[(combined_place_df['doc_id'].isin(paris_docs)) & (combined_place_df['word'].str.lower()=="paris")].index)

# Copenhagen
copenhagen_docs = df.loc[
    (df['tstring'].str.contains('(Copenhagen(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Copenhagen)')) |
    (df['tstring'].str.contains('(Copenhagen(?:\S* ){0,3}Accord)|(Accord(?:\S* ){0,3}Copenhagen)')) ,
    'id'
]
combined_place_df = combined_place_df.drop(combined_place_df[(combined_place_df['doc_id'].isin(copenhagen_docs)) & (combined_place_df['word'].str.lower()=="copenhagen")].index)

#Berlin
berlin_docs = df.loc[
    (df['tstring'].str.contains('(Berlin(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Berlin)')),
    'id'
]
combined_place_df = combined_place_df.drop(combined_place_df[(combined_place_df['doc_id'].isin(berlin_docs)) & (combined_place_df['word'].str.lower()=="berlin")].index)

#Glasgow
berlin_docs = df.loc[
    (df['tstring'].str.contains('(Glasgow(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Glasgow)')),
    'id'
]
combined_place_df = combined_place_df.drop(combined_place_df[(combined_place_df['doc_id'].isin(berlin_docs)) & (combined_place_df['word'].str.lower()=="berlin")].index)

#Cancun
cancun_docs = df.loc[
    (df['tstring'].str.contains('(Cancun(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Cancun)')) |
    (df['tstring'].str.lower().str.contains('cancun pledge')),
    'id'
]
combined_place_df = combined_place_df.drop(combined_place_df[(combined_place_df['doc_id'].isin(cancun_docs)) & (combined_place_df['word'].str.lower()=="cancun")].index)


geocolumns = ["feature_code", "lat", "lon", "place_name", "feature_class", "geonameid", "country_code3"]

combined_place_df.loc[combined_place_df["word"]=="Pakistan", geocolumns]=["PCLI",30,70,"Islamic Republic of Pakistan","A",1168579,"PAK"]
combined_place_df.loc[combined_place_df["word"]=="Colombia", geocolumns]=["PCLI",4,-73.25,"Colombia","A",3686110, "COL"]
combined_place_df.loc[combined_place_df["word"]=="Argentina", geocolumns]=["PCLI",-34,-64,"Argentine Republic","A",3865483, "ARG"]
combined_place_df.loc[combined_place_df["word"]=="Sahara", geocolumns] = ["DSRT", 26, 13, "Sahara", "T", 2212709, None]
combined_place_df.loc[combined_place_df["word"]=="Alps",geocolumns] = ["MTS", 46.41667, 10, "Alps", "T", 2661786, None]
combined_place_df.loc[combined_place_df["word"]=="Mediterranean Sea",geocolumns] = ["SEA", 35, 20, "Mediterranean Sea", "T", 2661786, None]
combined_place_df.loc[combined_place_df["word"]=="MEDITERRANEAN",geocolumns] = ["SEA", 35, 20, "Mediterranean Sea", "T", 2661786, None]
combined_place_df.loc[combined_place_df["word"]=="East China",geocolumns] = ["PCLI", 35, 105, "China", "A", 1814991, "CHN"]
combined_place_df.loc[combined_place_df["word"]=="South China",geocolumns] = ["PCLI", 35, 105, "China", "A", 1814991, "CHN"]
combined_place_df.loc[combined_place_df["word"]=="Great Lakes",geocolumns] = ["LK", 45.68751, -84.43753, "Great Lakes", "H", 4994594, "USA"]
combined_place_df.loc[combined_place_df["word"]=="Catalonia",geocolumns] = ["ADM1", 41.82046, 1.86768, "Catalunya", "A", 3336901, "ESP"]
combined_place_df.loc[combined_place_df["word"]=="South Pacific",geocolumns] = ["OCN", -45, -130, "South Pacific Ocean", "H", 4030483, None]
combined_place_df.loc[combined_place_df["word"]=="Gulf Coast",geocolumns] = ["AREA", 29.36901, -95.00565, "Gulf Coast", "L", 7287689, "USA"]
combined_place_df.loc[combined_place_df["word"]=="Gulf coast",geocolumns] = ["AREA", 29.36901, -95.00565, "Gulf Coast", "L", 7287689, "USA"]
combined_place_df.loc[combined_place_df["word"]=="Hainan Island",geocolumns] = ["ISL", 19.2, 109.7, "Hainan Dao", "T", 1809055, "CHN"]
combined_place_df.loc[combined_place_df["word"]=="Red Sea",geocolumns] = ["SEA", 20.26735, 38.53455, "Red Sea", "H", 350155, None]
combined_place_df.loc[combined_place_df["word"]=="Himalayan",geocolumns] = ["MTS", 28,84, "Himalayas", "T", 1252558, None]
combined_place_df.loc[combined_place_df["word"]=="Himalayas",geocolumns] = ["MTS", 28,84, "Himalayas", "T", 1252558, None]
combined_place_df.loc[combined_place_df["word"]=="North America's",geocolumns] = ["CONT", 46.07323, -100.54688, "North America", "L", 6255149, None]
combined_place_df.loc[combined_place_df["word"]=="Atlantic Ocean",geocolumns] = ["OCN", 10, -25, "Atlantic Ocean", "H", 3373405, None]
combined_place_df.loc[combined_place_df["word"]=="Scandinavia",geocolumns] = ["RGN", 63, 12, "Scandinavia", "L", 2614165, None]
combined_place_df.loc[combined_place_df["word"]=="California (USA",geocolumns] = ["ADM1", 37.25022, -119.75126, "California", "A", 5332921, "USA"]
combined_place_df.loc[combined_place_df["word"]=="California, USA",geocolumns] = ["ADM1", 37.25022, -119.75126, "California", "A", 5332921, "USA"]
combined_place_df.loc[combined_place_df["word"]=="North Pacific",geocolumns] = ["OCN", 30, -170, "North Pacific Ocean", "H", 4030875, None]
combined_place_df.loc[combined_place_df["word"]=="Huai",geocolumns] = ["STM", 33.133333, 118.5, "Huai He", "H", 1807690, "CHN"]
combined_place_df.loc[combined_place_df["word"]=="Washington, DC",geocolumns] = ["PPLC", 38.89511, -77.03637, "Washington", "P", 4140963, "USA"]
combined_place_df.loc[combined_place_df["word"]=="Messinian",geocolumns] = ["ADM2", 37.25, -21.83333, "Nomos Messinias", "A", 257149, "GRC"]
combined_place_df.loc[combined_place_df["word"]=="Ionian Sea",geocolumns] = ["SEA", 39, 19, "Ionian Sea", "H", 2463713, None]
combined_place_df.loc[combined_place_df["word"]=="NYC",geocolumns] = ["PPL", 40.71427, -74.00597, "New York City", "P", 5128581, "USA"]
combined_place_df.loc[combined_place_df["word"]=="Indian Ocean",geocolumns] = ["OCN", -10, 70, "Indian Ocean", "P", 1545739, None]
combined_place_df.loc[combined_place_df["word"]=="North Sea",geocolumns] = ["SEA", 55, 3, "North Sea", "P", 2960848, None]
combined_place_df.loc[combined_place_df["word"]=="Philippine Sea",geocolumns] = ["SEA", 20, 135, "Philippine Sea", "P", 1818190, None]
combined_place_df.loc[combined_place_df["word"]=="Black Sea",geocolumns] = ["SEA", 43, 34, "Black Sea", "H", 630673, None]
combined_place_df.loc[combined_place_df["word"]=="Coral Sea",geocolumns] = ["SEA", -20, 155, "Coral Sea", "H", 2194166, None]
combined_place_df.loc[combined_place_df["word"]=="Timor Sea",geocolumns] = ["SEA", -11, 127, "Timor Sea", "H", 2078065, None]
combined_place_df.loc[combined_place_df["word"]=="Hudson Bay",geocolumns] = ["BAY", 60, -85, "Hudson Bay", "H", 5978134, "CAN"]
combined_place_df.loc[combined_place_df["word"]=="Bering Sea",geocolumns] = ["SEA", 60, -175, "Bering Sea", "H", 4031788, None]
combined_place_df.loc[combined_place_df["word"]=="Okhotsk Sea",geocolumns] = ["SEA", 55, 150, "Sea of Okhotsk", "H", 2127380, None]

combined_place_df.loc[combined_place_df["place_name"]=="Central Upper Nile",geocolumns] = ["ADM1", 10, 32.7, "Upper Nile", "A", 381229, "SSD"]
combined_place_df.loc[combined_place_df["place_name"]=="Gobolka Woqooyi Galbeed","place_name"] = "Woqooyi Galbeed"

combined_place_df = combined_place_df[combined_place_df["place_name"]!="Pacific County"]
combined_place_df = combined_place_df.loc[combined_place_df["word"]!="B.V."]
combined_place_df = combined_place_df[combined_place_df["word"]!="MMT"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Yellow"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Hadley"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Western North"]
combined_place_df = combined_place_df[combined_place_df["word"]!="colonies"]
combined_place_df = combined_place_df[combined_place_df["word"]!="TN"]
combined_place_df = combined_place_df[combined_place_df["word"]!="NH"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Mn"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Tx"]
combined_place_df = combined_place_df[combined_place_df["word"]!="TX"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Tn"]
combined_place_df = combined_place_df[combined_place_df["word"]!="FL"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Spartina"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Tamarix"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Eurasia"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Phillyrea"]
combined_place_df = combined_place_df[combined_place_df["word"]!="N-15"]
combined_place_df = combined_place_df[combined_place_df["word"]!="LT50"]
combined_place_df = combined_place_df[combined_place_df["word"]!="POSEIDON"]
combined_place_df = combined_place_df[combined_place_df["word"]!="LC50"]
combined_place_df = combined_place_df[combined_place_df["word"]!="El Nio"]
combined_place_df = combined_place_df[combined_place_df["word"]!="La Nia"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Red"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Gulf Stream"]
combined_place_df = combined_place_df[combined_place_df["word"].str.len()>2]
combined_place_df = combined_place_df[combined_place_df["word"]!="NH 1"]
combined_place_df = combined_place_df[combined_place_df["word"]!="Quercus"]


combined_place_df = combined_place_df[(combined_place_df["word"]!="ZJP")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="MSW")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="CCS")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="Tier-3")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="N2O")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="VKT")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="OECD")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="States")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="North to South")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="Stabilising")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="Mass Railway")]
combined_place_df = combined_place_df[(combined_place_df["word"]!="City")]

combined_place_df.loc[combined_place_df["word"]=="Ireland", geocolumns]=["PCLI",53,-8,"Ireland","A",2963597,"IRL"]
combined_place_df.loc[combined_place_df["word"]=="United States", geocolumns] = ["PCLI",39.76,-98.5,"United States","A",6252001, "USA"]
combined_place_df.loc[combined_place_df["word"]=="Czech Republic", geocolumns] = ["PCLI",49.75,15,"Czechia","A",3077311, "CZE"]
combined_place_df.loc[combined_place_df["word"]=="Czechia", geocolumns] = ["PCLI",49.75,15,"Czechia","A",3077311, "CZE"]
combined_place_df.loc[combined_place_df["word"]=="China", geocolumns] = ["PCLI", 35, 105, "China", "A", 1814991, "CHN"]
combined_place_df.loc[combined_place_df["word"]=="United Arab Emirates", geocolumns] = ["PCLI", 23.75, 54.5, "United Arab Emirates", "A", 290557, "ARE"]


# import pycountry_convert as pc
# def get_cont(x):
#     continents = {
#         'NA': 'North America',
#         'SA': 'South America', 
#         'AS': 'Asia',
#         'OC': 'Oceania',
#         'AF': 'Africa',
#         'EU': 'Europe'
#     }
#     try:
#         return continents[pc.country_alpha2_to_continent_code(pc.country_alpha3_to_country_alpha2(x))]
#     except:
#         return None


# combined_place_df['continent'] = combined_place_df['country_code3'].apply(lambda x: get_cont(x))

combined_place_df.to_csv('data/places.csv', index=False)

print(combined_place_df.shape)

combined_place_df.tail()



  (df['tstring'].str.contains('(Paris(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Paris)')) |
  (df['tstring'].str.contains('(Paris(?:\S* ){0,15}Agreement)|(COP(?:\S* ){0,15}Agreement)')) ,
  (df['tstring'].str.contains('(Copenhagen(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Copenhagen)')) |
  (df['tstring'].str.contains('(Copenhagen(?:\S* ){0,3}Accord)|(Accord(?:\S* ){0,3}Copenhagen)')) ,
  (df['tstring'].str.contains('(Berlin(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Berlin)')),
  (df['tstring'].str.contains('(Glasgow(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Glasgow)')),
  (df['tstring'].str.contains('(Cancun(?:\S* ){0,15}COP)|(COP(?:\S* ){0,15}Cancun)')) |


(43064, 13)


Unnamed: 0,word,spans,country_predicted,country_conf,admin1,lat,lon,country_code3,geonameid,place_name,feature_class,feature_code,doc_id
13756,Labrador Sea,,,0.8,,57,-55,,3424929,Labrador Sea,H,SEA,https://openalex.org/W2066743080
14331,Labrador Sea,,,0.8,,57,-55,,3424929,Labrador Sea,H,SEA,https://openalex.org/W1606968016
15441,Labrador Sea,,,0.8,,57,-55,,3424929,Labrador Sea,H,SEA,https://openalex.org/W3171583560
11553,Baffin Bay,,,0.8,,74,-68,,3831554,Baffin Bay,H,BAY,https://openalex.org/W2049963706
14331,Baffin Bay,,,0.8,,74,-68,,3831554,Baffin Bay,H,BAY,https://openalex.org/W1606968016
