# Geocoding EHRI camps to detect potential errors

The purpose of this notebook is to retrieve the entire list of concentration camps provided by the EHRI Portal using its GraphQL API, find the camps that already contain longitude and latitude values, geocode them again based on their names and alternative names using the Nominatim API and then compare the results to detect potential errors in the EHRI datasets.

First, we're going to import some Python libraries that we're going to use.

In [1]:
import pandas as pd
import geopy
from ipyleaflet import Map, basemaps
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from tqdm import tqdm
import requests
import json

We define the GraphQL query that will help us retrieve the camps from the EHRI portal (including their names, alternative names, latitude and longitude data and the list of subcamps or main camps potentially assigned to them). More on this in [this Medium post](https://towardsdatascience.com/connecting-to-a-graphql-api-using-python-246dda927840).

In [2]:
query = """query campsInfo {
  CvocVocabulary(id:"ehri_camps") {
    concepts {
      items {
        id
        description {
          name
          altLabel
        }
        latitude
        longitude
        broader {
          id
        }
        narrower {
          id
        }
      }
    }
  }
}"""

We feed the aforementioned query to the EHRI GraphQL API and assign the result of this request to the `r` variable.

In [3]:
url = 'https://portal.ehri-project-stage.eu/api/graphql'
r = requests.post(url, headers = {"X-Stream": "true"}, json={'query': query})
print(r.status_code)
print(r.text)

200
{
  "data" : {
    "CvocVocabulary" : {
      "concepts" : {
        "items" : [ {
          "id" : "ehri_camps-60",
          "description" : {
            "name" : "München-Allach (Porzellanmanufaktur) concentration camp",
            "altLabel" : [ " München-Allach (PMA) concentration camp", "München (Porzellanmanufaktur) concentration camp" ]
          },
          "latitude" : 48.1887,
          "longitude" : 11.4709,
          "broader" : [ {
            "id" : "ehri_camps-177"
          } ],
          "narrower" : [ ]
        }, {
          "id" : "ehri_camps-70",
          "description" : {
            "name" : "Concentration Camp Gusen I",
            "altLabel" : [ ]
          },
          "latitude" : null,
          "longitude" : null,
          "broader" : [ {
            "id" : "ehri_camps-570"
          } ],
          "narrower" : [ {
            "id" : "ehri_camps-2081"
          }, {
            "id" : "ehri_camps-953"
          } ]
        }, {
          "id" : "e

We deserialise the EHRI Portal's GraphQL API response to a Python object so that we are able to use it with Python.

In [4]:
json_data = json.loads(r.text)

We save the data into a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), an Excel-like tabular data structure (almost equivalent to an Excel worksheet).

*For a deeper dive into the Python pandas library and how it compares to Excel, have a look [here](https://pandas.pydata.org/pandas-docs/dev/getting_started/comparison/comparison_with_spreadsheets.html).*

In [5]:
df_data = json_data['data']['CvocVocabulary']['concepts']['items']
df = pd.DataFrame(df_data)

Let's print the first ten rows of the DataFrame that we created (`df`) to check whether we got the intended result.

In [6]:
df.head(10)

Unnamed: 0,id,description,latitude,longitude,broader,narrower
0,ehri_camps-60,{'name': 'München-Allach (Porzellanmanufaktur)...,48.1887,11.4709,[{'id': 'ehri_camps-177'}],[]
1,ehri_camps-70,"{'name': 'Concentration Camp Gusen I', 'altLab...",,,[{'id': 'ehri_camps-570'}],"[{'id': 'ehri_camps-2081'}, {'id': 'ehri_camps..."
2,ehri_camps-2948,{'name': 'Königsberg (Neumark) concentration c...,52.9667,14.4333,[{'id': 'ehri_camps-760'}],[]
3,ehri_camps-3133,"{'name': 'Mainz-Weisenau concentration camp', ...",49.983299,8.3,[{'id': 'ehri_camps-108'}],[]
4,ehri_camps-2815,"{'name': 'Struppen concentration camp', 'altLa...",50.9333,14.016699,[],[]
5,ehri_camps-2900,{'name': 'Kostolná pri Trenčíne concentration ...,48.883333,17.972222,[],[]
6,ehri_camps-2285,"{'name': 'Smel’chintsy concentration camp', 'a...",,,[],[]
7,ehri_camps-2067,{'name': 'Blankenburg-Regenstein concentration...,,,[{'id': 'ehri_camps-627'}],[]
8,ehri_camps-2371,"{'name': 'Opatówek concentration camp', 'altLa...",,,[],[]
9,ehri_camps-3000,"{'name': 'Petseri concentration camp', 'altLab...",58.3156,26.724,[{'id': 'ehri_camps-961'}],[]


Based on the result of the previous cell, we observe that our `description` column still contains a Python dictionary, and we cannot immediately access the names and alt labels of the camps. We need to extract this information and save it into separate columns in our DataFrame.

A way to do this is by first creating some empty lists that will contain the extracted information.

In [7]:
names = []
alt_names_list = []
alt_names = []

We iterate over every camp in the DataFrame and append its name to the `names` list we created earlier and its alt labels to the `alt_names_list` we created earlier.

In [8]:
for camp in df_data:
    names.append(camp['description']['name'])
    alt_names_list.append(camp['description']['altLabel'])
    print('Added ' + camp['description']['name'])

Added München-Allach (Porzellanmanufaktur) concentration camp
Added Concentration Camp Gusen I
Added Königsberg (Neumark) concentration camp
Added Mainz-Weisenau concentration camp
Added Struppen concentration camp
Added Kostolná pri Trenčíne concentration camp
Added Smel’chintsy concentration camp
Added Blankenburg-Regenstein concentration camp
Added Opatówek concentration camp
Added Petseri concentration camp
Added Trostineţ concentration camp
Added Kleinbodungen concentration camp
Added Vivikonna OT concentration camp
Added Dobra concentration camp
Added Herzogenbusch (Continental Gummiwerke AG) concentration camp
Added Senftenberg concentration camp
Added Novovitebskoe concentration camp
Added Camugnano concentration camp
Added Septfonds concentration camp
Added Pavlograd concentration camp
Added München-Stadelheim concentration camp
Added Oleaniţa concentration camp
Added Bazzano concentration camp
Added Äänislinna II concentration camp
Added Kulupe concentration camp
Added Šilalė

Added Krośniewice concentration camp
Added Bodenwiese concentration camp
Added Trutenau concentration camp
Added Oberberg concentration camp
Added Spindlersfelde concentration camp
Added Birnbäumel concentration camp
Added München (RF-SS-Hauptkasse) concentration camp
Added Annener Gußstahlwerk concentration camp
Added Hermanów concentration camp
Added Obernigk concentration camp
Added Cebriv concentration camp
Added Danzig-Langfuhr concentration camp
Added Sulejów concentration camp
Added Janischken concentration camp
Added Adelnau concentration camp
Added Kallies concentration camp
Added Salzburg (Bombensuchkommando)
Added Dondangen II concentration camp
Added Kaweczyn concentration camp
Added Flöha concentration camp
Added Thorn (SS-Neubauleitung) concentration camp
Added Lipowa concentration camp
Added Gurs concentration camp
Added S-Gravenhage concentration camp
Added Westerbork concentration camp
Added Anhalt O/S concentration camp
Added Janowska concentration camp
Added Abitzau 

We print the first ten items of each result.

In [9]:
names[0:10]

['München-Allach (Porzellanmanufaktur) concentration camp',
 'Concentration Camp Gusen I',
 'Königsberg (Neumark) concentration camp',
 'Mainz-Weisenau concentration camp',
 'Struppen concentration camp',
 'Kostolná pri Trenčíne concentration camp',
 'Smel’chintsy concentration camp',
 'Blankenburg-Regenstein concentration camp',
 'Opatówek concentration camp',
 'Petseri concentration camp']

In [10]:
alt_names_list[0:10]

[[' München-Allach (PMA) concentration camp',
  'München (Porzellanmanufaktur) concentration camp'],
 [],
 [],
 [],
 [],
 [],
 ['Smil’chyntsi concentration camp'],
 ['Turmalin concentration camp'],
 [],
 []]

As we can see, the `names` list contains only one default name of each camp, but the `alt_names_list` is a list of lists since some camps have multiple alternative names. We iterate over every list contained in `alt_names_list` to extract each individual name and add it to a string that will contain all the alternative names of a camp to facilitate our geocoding process.

In [11]:
for i in range(len(alt_names_list)):
    str = ""
    for j in alt_names_list[i]:
        str += j + " "
    alt_names.append(str)
        

In [12]:
alt_names[0:10]

[' München-Allach (PMA) concentration camp München (Porzellanmanufaktur) concentration camp ',
 '',
 '',
 '',
 '',
 '',
 'Smil’chyntsi concentration camp ',
 'Turmalin concentration camp ',
 '',
 '']

Now we can add the names and alt names as DataFrame columns for easier manipulation and visualisation of our dataset.

In [13]:
df['name'] = names
df['alt_names'] = alt_names

In [14]:
df.head()

Unnamed: 0,id,description,latitude,longitude,broader,narrower,name,alt_names
0,ehri_camps-60,{'name': 'München-Allach (Porzellanmanufaktur)...,48.1887,11.4709,[{'id': 'ehri_camps-177'}],[],München-Allach (Porzellanmanufaktur) concentra...,München-Allach (PMA) concentration camp Münch...
1,ehri_camps-70,"{'name': 'Concentration Camp Gusen I', 'altLab...",,,[{'id': 'ehri_camps-570'}],"[{'id': 'ehri_camps-2081'}, {'id': 'ehri_camps...",Concentration Camp Gusen I,
2,ehri_camps-2948,{'name': 'Königsberg (Neumark) concentration c...,52.9667,14.4333,[{'id': 'ehri_camps-760'}],[],Königsberg (Neumark) concentration camp,
3,ehri_camps-3133,"{'name': 'Mainz-Weisenau concentration camp', ...",49.983299,8.3,[{'id': 'ehri_camps-108'}],[],Mainz-Weisenau concentration camp,
4,ehri_camps-2815,"{'name': 'Struppen concentration camp', 'altLa...",50.9333,14.016699,[],[],Struppen concentration camp,


To facilitate geocoding based on the name of the camp, we might want to remove the term 'concentration camp' and the parentheses from the names of the camps. To do this, we use regular expressions, and then we save the result into two new `df` columns, `name_regex` and `alt_names_regex`.

In [15]:
df['name_regex'] = df['name'].replace(to_replace=r'(?i)concentration camp|[()]', value='', regex=True)
df['alt_names_regex'] = df['alt_names'].replace(to_replace=r'(?i)concentration camp|[()]', value='', regex=True)

In [16]:
df.head()

Unnamed: 0,id,description,latitude,longitude,broader,narrower,name,alt_names,name_regex,alt_names_regex
0,ehri_camps-60,{'name': 'München-Allach (Porzellanmanufaktur)...,48.1887,11.4709,[{'id': 'ehri_camps-177'}],[],München-Allach (Porzellanmanufaktur) concentra...,München-Allach (PMA) concentration camp Münch...,München-Allach Porzellanmanufaktur,München-Allach PMA München Porzellanmanufakt...
1,ehri_camps-70,"{'name': 'Concentration Camp Gusen I', 'altLab...",,,[{'id': 'ehri_camps-570'}],"[{'id': 'ehri_camps-2081'}, {'id': 'ehri_camps...",Concentration Camp Gusen I,,Gusen I,
2,ehri_camps-2948,{'name': 'Königsberg (Neumark) concentration c...,52.9667,14.4333,[{'id': 'ehri_camps-760'}],[],Königsberg (Neumark) concentration camp,,Königsberg Neumark,
3,ehri_camps-3133,"{'name': 'Mainz-Weisenau concentration camp', ...",49.983299,8.3,[{'id': 'ehri_camps-108'}],[],Mainz-Weisenau concentration camp,,Mainz-Weisenau,
4,ehri_camps-2815,"{'name': 'Struppen concentration camp', 'altLa...",50.9333,14.016699,[],[],Struppen concentration camp,,Struppen,


To geocode the locations of the camps, we need to construct a query that we will feed to the geocoder. The only information returned by the EHRI GraphQL API that we can use for this purpose is the names and alt names of the camps since our datasets do not contain additional geographical information, such as the names of the cities or countries where the camps were located. Since the names of the camps do not form complete addresses, the chances that the geocoder will yield good and precise results are limited. For this reason, we need to maximise our chances of success by constructing queries that are as helpful to the geocoder as possible. A method that we could use to improve our success rate would be to add to the query the alt names in addition to the default name of each camp. Another thing we could try (to improve the precision of retrieved locations) would be to add the phrase 'concentration camp' to the queries in case the geocoder can locate the exact location of a camp based on OpenStreetMap data.

We add two more columns to our df for each query we want to try out.

In [17]:
df['query'] = df['name_regex'] + df['alt_names_regex']
df['query_with_cc'] = df['name_regex'] + df['alt_names_regex'] + ' concentration camp'

In [18]:
df.head()

Unnamed: 0,id,description,latitude,longitude,broader,narrower,name,alt_names,name_regex,alt_names_regex,query,query_with_cc
0,ehri_camps-60,{'name': 'München-Allach (Porzellanmanufaktur)...,48.1887,11.4709,[{'id': 'ehri_camps-177'}],[],München-Allach (Porzellanmanufaktur) concentra...,München-Allach (PMA) concentration camp Münch...,München-Allach Porzellanmanufaktur,München-Allach PMA München Porzellanmanufakt...,München-Allach Porzellanmanufaktur München-Al...,München-Allach Porzellanmanufaktur München-Al...
1,ehri_camps-70,"{'name': 'Concentration Camp Gusen I', 'altLab...",,,[{'id': 'ehri_camps-570'}],"[{'id': 'ehri_camps-2081'}, {'id': 'ehri_camps...",Concentration Camp Gusen I,,Gusen I,,Gusen I,Gusen I concentration camp
2,ehri_camps-2948,{'name': 'Königsberg (Neumark) concentration c...,52.9667,14.4333,[{'id': 'ehri_camps-760'}],[],Königsberg (Neumark) concentration camp,,Königsberg Neumark,,Königsberg Neumark,Königsberg Neumark concentration camp
3,ehri_camps-3133,"{'name': 'Mainz-Weisenau concentration camp', ...",49.983299,8.3,[{'id': 'ehri_camps-108'}],[],Mainz-Weisenau concentration camp,,Mainz-Weisenau,,Mainz-Weisenau,Mainz-Weisenau concentration camp
4,ehri_camps-2815,"{'name': 'Struppen concentration camp', 'altLa...",50.9333,14.016699,[],[],Struppen concentration camp,,Struppen,,Struppen,Struppen concentration camp


The task of this notebook is to geocode the locations of camps for which we already have longitude and latitude data to crosscheck whether the geospatial data currently offered through the EHRI portal is trustworthy enough or whether we need to validate the locations of certain camps. The strategy will be to geocode as many camps as possible and then compare the geocoding result with the already existing lat/long data to highlight any big discrepancies among them. If a camp's already inputted coordinates are 5km or more apart from the coordinates retrieved through the geocoding process, we will add this camp to the list of camps that need their locations checked again.

To do this, first, we create a DataFrame containing the camps that already have their `latitude` and `longitude` data fields filled. We also create a DataFrame of the camps for which we do not currently hold any geodata (for future use).

In [19]:
camps_without_geodata = df[df.latitude.isnull()]

In [20]:
camps_without_geodata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1343 entries, 1 to 3072
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1343 non-null   object 
 1   description      1343 non-null   object 
 2   latitude         0 non-null      float64
 3   longitude        0 non-null      float64
 4   broader          1343 non-null   object 
 5   narrower         1343 non-null   object 
 6   name             1343 non-null   object 
 7   alt_names        1343 non-null   object 
 8   name_regex       1343 non-null   object 
 9   alt_names_regex  1343 non-null   object 
 10  query            1343 non-null   object 
 11  query_with_cc    1343 non-null   object 
dtypes: float64(2), object(10)
memory usage: 136.4+ KB


In [21]:
camps_with_geodata = df[df.latitude.isnull() == False]

In [22]:
camps_with_geodata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1731 entries, 0 to 3073
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1731 non-null   object 
 1   description      1731 non-null   object 
 2   latitude         1731 non-null   float64
 3   longitude        1731 non-null   float64
 4   broader          1731 non-null   object 
 5   narrower         1731 non-null   object 
 6   name             1731 non-null   object 
 7   alt_names        1731 non-null   object 
 8   name_regex       1731 non-null   object 
 9   alt_names_regex  1731 non-null   object 
 10  query            1731 non-null   object 
 11  query_with_cc    1731 non-null   object 
dtypes: float64(2), object(10)
memory usage: 175.8+ KB


We will use the [GeoPy](https://geopy.readthedocs.io/en/stable/#nominatim) library and the Nominatim API to geocode the camps. Nominatim lets us search through OpenStreetMap data by names and addresses. For more information, see [here](https://nominatim.org/release-docs/develop/). To improve our results and avoid getting coordinates in places where we already know there were no concentration camps (Japan, China, Canada, etc), we also set the viewbox property to make sure our geolocator only considers the areas that are of interest to us.

In [23]:
geolocator = Nominatim(user_agent="geocoding_EHRI_camps")

The following cell takes a lot of time to run (usually around 30 minutes). Please, only run it if it is necessary. The results of a previous run are provided with this notebook as a saved pickle file which you can import further down through the notebook.

In [24]:
# geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

# tqdm.pandas()

# camps_geolocate = camps_with_geodata['query'].progress_apply(geocode,viewbox=[(68.536656, -21.104075),(-1.418345, 49.214965)], bounded=True)


In [25]:
# camps_geolocate

In [26]:
# camps_with_geodata['query_camps_point'] = camps_geolocate.apply(lambda loc: tuple(loc.point) if loc else None)

In [27]:
# camps_with_geodata.info()

Having geocoded our camps, we have a DataFrame that contains both the prefilled lat/long data and the geocoding results. However, the geocoder fails to retrieve information for some of the camps in our dataset and we end up with empty fields under the `query_camps_point` column. Since we cannot compare our prefilled locations with null coordinates, we remove the camps with no geocoding results and create a DataFrame that only contains camps with both coordinates.

In [28]:
# full_camps_with_regex_VIEWBOX = camps_with_geodata[camps_with_geodata['query_camps_point'].isnull() == False]
# full_camps_with_regex_VIEWBOX.head()

We save the result to the pickle format to retain data types (if saved to csv then import ast and use literal_eval).

In [29]:
# full_camps_with_regex_VIEWBOX.to_pickle('full_camps_with_regex_VIEWBOX.pkl')

Now that we have saved the geocoding result into a file, we can import it and work on it again without having to rerun the time-consuming geocoding function.

In [30]:
full_camps_with_regex_VIEWBOX = pd.read_pickle('data/full_camps_with_regex_VIEWBOX.pkl')

In [31]:
full_camps_with_regex_VIEWBOX.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1056 entries, 2 to 3071
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 1056 non-null   object 
 1   description        1056 non-null   object 
 2   latitude           1056 non-null   float64
 3   longitude          1056 non-null   float64
 4   broader            1056 non-null   object 
 5   narrower           1056 non-null   object 
 6   names              1056 non-null   object 
 7   alt_names          1056 non-null   object 
 8   names_regex        1056 non-null   object 
 9   alt_names_regex    1056 non-null   object 
 10  query              1056 non-null   object 
 11  with_cc_query      1056 non-null   object 
 12  query_camps_point  1056 non-null   object 
dtypes: float64(2), object(11)
memory usage: 115.5+ KB


We repeat the same process to get more geocoding results, this time using the query that additionally contains the 'concentration camp' phrase.

In [32]:
# camps_geolocate_with_cc = camps_with_geodata['query_with_cc'].progress_apply(geocode)

In [33]:
# camps_with_geodata['query_camps_point_with_cc'] = camps_geolocate_with_cc.apply(lambda loc: tuple(loc.point) if loc else None)

In [34]:
# full_camps_with_cc = camps_with_geodata[camps_with_geodata['query_camps_point_with_cc'].isnull() == False]
# full_camps_with_cc.head()

In [35]:
# full_camps_with_cc.to_pickle('full_camps_with_cc_pickle.pkl')

In [36]:
full_camps_with_cc = pd.read_pickle('data/full_camps_with_cc_pickle.pkl')

Additionally, we might want to compare the geodata in our datasets with the SS camps geodata from [Holocaust Geographies](https://holocaustgeographies.org/). This dataset has also been uploaded to EHRI's geographic repository, and we can request it from the EHRI GeoServer as a GeoJSON file.

Getting the data as GeoJSON from geonode does not always return a successful result (it seems we need to log in again and get a new link to generate a new token). Saving the GeoJSON result as a JSON file in the disk seems more stable across different runs.

In [37]:
# url = 'https://geonode.ehri-project-test.eu/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typename=geonode%3Ass_camps_definitive&outputFormat=json&srs=EPSG%3A4326&srsName=EPSG%3A4326'
# re = requests.get(url)
# re.json()
with open('data/sscamps.json','r') as f:
    data = json.load(f)

In [38]:
# holocaust_geo_json_data = json.loads(re.text)
holocaust_geo_json_data = data

In [39]:
hg_data = holocaust_geo_json_data['features']

In [40]:
hg_data

[{'type': 'Feature',
  'id': 'ss_camps_definitive.1759',
  'geometry': {'type': 'Point', 'coordinates': [23.974599, 55.288299]},
  'geometry_name': 'the_geom',
  'properties': {'fid': 1759,
   'ID': '10-0490-0',
   'MHG_ID': 1133,
   'SUBCAMP': 'Kedahnen',
   'MAIN': 'Kauen',
   'WOMEN': 0,
   'MEN': 0,
   'GENDER': 0,
   'FIRMS': '',
   'YYYY_OPEN': '43',
   'MM_OPEN': '12',
   'DD_OPEN': '',
   'OPEN_TXT': '"First mentioned December 1943"',
   'YYYY_CLOSE': '44',
   'MM_CLOSE': '07',
   'DD_CLOSE': '',
   'CLOSE_TXT': '"July 1944"',
   'PRISONERS': '',
   'DATE_OPEN': '1943-12-15T00:00:00Z',
   'DATE_CLOSE': '1944-07-15T00:00:00Z',
   'PEAK_POP': 300,
   'FUNC_1': 1,
   'FUNC_2': 0,
   'HOW_FOUND': '',
   'SHARE_LOC': 0,
   'LAT': 55.288299,
   'LONG': 23.974599,
   'NATIONS': '',
   'ENCY_REF': 'Encyclopedia of Camps, Volume 1 pg 856',
   'LABOR': 'Unknown',
   'EDIT_NOTES': '',
   'FIRMABBREV': None}},
 {'type': 'Feature',
  'id': 'ss_camps_definitive.1760',
  'geometry': {'type': 

We can now visualise the information that we got onto a map that contains a different layer for each data source.

In [41]:
from ipywidgets import HTML
from ipyleaflet import Map, TileLayer, basemaps, Marker, Popup, CircleMarker, LayerGroup, LayersControl

center = [50.998235, 6.676380]
zoom = 5
m = Map(center=center, zoom=zoom)

# Create layer group
EHRI_Portal = LayerGroup(name='EHRI_Portal')
QUERY_GEOCODE = LayerGroup(name='QUERY_GEOCODE')
QUERY_WITH_CC_GEOCODE = LayerGroup(name='QUERY_WITH_CC_GEOCODE')
HG_SS_CAMPS = LayerGroup(name='HG_SS_CAMPS')


for index, row in camps_with_geodata.iterrows():
    color = '#34e912'
    circleMarker = CircleMarker(
    location=(row['latitude'],row['longitude']),
    color=color,
    weight=2
    )
#     m.add_layer(circleMarker)
    EHRI_Portal.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['name']}"

    # Popup associated to a layer
    circleMarker.popup = message
    
m.add_layer(EHRI_Portal)

for index, row in full_camps_with_regex_VIEWBOX.iterrows():
    color = '#be00e0'
    circleMarker = CircleMarker(
    location=(row['query_camps_point'][0],row['query_camps_point'][1]),
#     location=(row['geo_lat'],row['geo_long']),
    color=color,
    weight=2
    )
    QUERY_GEOCODE.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['names']}"

    # Popup associated to a layer
    circleMarker.popup = message

m.add_layer(QUERY_GEOCODE)

for index, row in full_camps_with_cc.iterrows():
    color = '#59251e'
    circleMarker = CircleMarker(
    location=(row['with_cc_query_camps_point'][0],row['with_cc_query_camps_point'][1]),
    color=color,
    weight=2
    )
    QUERY_WITH_CC_GEOCODE.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['names']}"

    # Popup associated to a layer
    circleMarker.popup = message

m.add_layer(QUERY_WITH_CC_GEOCODE)

for row in hg_data:
    color = 'blue'
    circleMarker = CircleMarker(
    location=(row['geometry']['coordinates'][1],row['geometry']['coordinates'][0]),
    color=color,
    weight=1,
    fill=False,
    dashArray=1
    )
    HG_SS_CAMPS.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['id']}<br />Main: {row['properties']['MAIN']}<br />Subcamp: {row['properties']['SUBCAMP']}"

    # Popup associated to a layer
    circleMarker.popup = message

m.add_layer(HG_SS_CAMPS)

control = LayersControl(position='topright')

m.add_control(control)

m.layout.height = '1000px'

m

Map(center=[50.998235, 6.67638], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', '…

This visualisation contains thousands of points, making it very hard to analyse the results. Instead, we have to focus only on the camps for which we have a reason to believe their locations might be incorrect. Although this method will return many false positives due to the lower quality of the data retrieved through automated geocoding compared to the manually inputted coordinates found in the EHRI portal, we can calculate the distance between these two points and single out the camps where the prefilled location is more than 5km apart from the geocoded location.

The GeoPy library offers a predefined distance function that lets you calculate the geodesic distance between two points. See [here](https://geopy.readthedocs.io/en/stable/#module-geopy.distance). We use this function to calculate the distance between the point retrieved through geocoding and the prefilled location of each camp and save the result under a column which we call 'distance_from_EHRI_camp' in the full_camps_with_regex_VIEWBOX DataFrame.

In [42]:
from geopy import distance
for index, row in camps_with_geodata.iterrows():
    for i, r, in full_camps_with_regex_VIEWBOX.iterrows():
        if index == i:
            full_camps_with_regex_VIEWBOX.at[i,'distance_from_EHRI_camp'] = geopy.distance.distance((row["latitude"], row["longitude"]), r["query_camps_point"]).km
            print(geopy.distance.distance((row["latitude"], row["longitude"]), r["query_camps_point"]).km)

0.8292991387786396
1.1877615430596995
0.5032530039454259
76.5081117976914
100.9128094270784
0.025573678866701918
0.2614305602928395
0.026605501555696066
0.17766086786536364
0.7683939068368818
0.4714775273620446
0.6085734181990128
1.100757961774793
1.037428922648944
0.17000006187644678
0.15380063896938378
0.5974353222770276
0.1282633950600572
1.0207108871955657
0.5992980736974193
1.6303581977680757
0.4487985608186154
42.72327493900284
3.4487637497877133
1.1080096268227924
2.4167464199161204
1.0142505898341005
0.13833077107492223
0.35761028376279486
4582.130433794377
4.577055796595803
405.225135119001
1.059812066559108
0.9196708726774692
0.46931230092930426
0.659257689328134
0.09888733322621514
1.3734329703206787
0.13700392649198717
0.1974322897048694
1.5026787203974716
0.22191181382803296
103.78387463821899
0.7385115460984378
1283.0962698169244
0.890252316550121
0.6864447303217103
2.8408168915296312
0.38752530915884753
0.3764234157139098
0.21741672276396729
0.01537250343711567
0.4132017

0.6571658053415814
0.8648172285177976
690.612454606415
1026.3722084974904
0.8454214871561324
0.6931606474834092
0.19546199015307764
0.464268144056196
2.0039160372035973
1.636835271273391
0.8330429024869992
0.6829381198094172
1.2575763655546002
10.69174624259476
0.2384417989874463
0.1564853315354106
30.40803571611197
1.9614949438996836
0.28530576019734527
0.10102036709262742
0.5756788705524484
0.7551486955760264
300.74157441928304
0.39783517689970077
2.5673952365203414
3.804899367962307
1.053159931926226
1.1473124594961666
0.1930587602866077
2.565170221335352
0.3773729375892212
0.5805687774383888
358.1387006247745
0.2805344169702964
1688.5389705130071
0.5425831742546943
0.1850281526530881
0.5349639555270665
383.90222046960514
0.12585465605956817
339.1833610694465
3.12188448814992
0.37975258079734475
0.47677133971288743
0.698843453717447
0.43207568890491227
4.3403055906070644
9.540085349527388
1119.5502726987488
379.3838019145075
1143.7067880070747
16.54697150074452
3.0470631629505034
39

4.25596528583234
1.8015773936140198
17.10838539232167
1.963866491240727
2.4988191771938313
0.8609430786165472
139.40189691982835
5.7321790914367465
0.17960148643263663
1.3849215988796617
0.5788208134327071
0.3975326991670088
0.08157268797888846
0.7978853306628406
117.32228335688872
0.3054051608894424
404.1554791676582
1.6682041693335448
0.24232343123841563
445.6290382053911
0.6515569832253858
0.06322477401445466
0.19913962150881362
707.5700353235936
1.3158347375791468
4.948186571967049
1.2096632291816871
0.41164211530288236
2553.981263610235
0.25221595766240756
0.5869887386280545
0.6891216166705554
0.6011285130741358
0.8976085065001779
2.307933258467379
1.085395582998208
0.051221680538785216
10.401199110895949
488.55142950492467
0.9507804927838327
5.954781202200515
368.4162708354129
2.3985793174484673
2.328720404977214
0.3270813516942255
425.40732529976304
1.2707803807963496
0.3562660548656708
0.5756105623141589
3.3761990155493606
0.4207567299046595
0.32867038854238967
1.69315624519838

We set the distance threshold in km.

In [43]:
distance_threshold = 5 #in kilometres

We create a new DataFrame that contains the camps with potential errors based on our km distance threshold.

In [44]:
camps_with_potential_errors = full_camps_with_regex_VIEWBOX[full_camps_with_regex_VIEWBOX["distance_from_EHRI_camp"] > distance_threshold]

In [45]:
camps_with_potential_errors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259 entries, 9 to 3061
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       259 non-null    object 
 1   description              259 non-null    object 
 2   latitude                 259 non-null    float64
 3   longitude                259 non-null    float64
 4   broader                  259 non-null    object 
 5   narrower                 259 non-null    object 
 6   names                    259 non-null    object 
 7   alt_names                259 non-null    object 
 8   names_regex              259 non-null    object 
 9   alt_names_regex          259 non-null    object 
 10  query                    259 non-null    object 
 11  with_cc_query            259 non-null    object 
 12  query_camps_point        259 non-null    object 
 13  distance_from_EHRI_camp  259 non-null    float64
dtypes: float64(3), object(11)

Using the 5km distance threshold, we can see that 259 camps need further attention to validate whether the data they contain is correct or incorrect.

We visualise the result and see that our map now contains fewer points that are easier to analyse. We can further reduce this number by increasing the distance threshold or deactivating some of the layers. Perhaps the best source to consult when validating these locations is the USHMM Encyclopedia of Camps and Ghettos, which can be downloaded through this [link](https://www.ushmm.org/research/publications/encyclopedia-camps-ghettos). Although it does not contain the exact coordinates of the camps, it contains descriptions of the places, and the correct locations can be (at least approximately) deduced.

In [46]:
center = [50.998235, 6.676380]
zoom = 5
m = Map(center=center, zoom=zoom)

# Create layer group
EHRI_Portal = LayerGroup(name='EHRI_Portal')
QUERY_GEOCODE = LayerGroup(name='QUERY_GEOCODE')
QUERY_WITH_CC_GEOCODE = LayerGroup(name='QUERY_WITH_CC_GEOCODE')
HG_SS_CAMPS = LayerGroup(name='HG_SS_CAMPS')


for index, row in camps_with_potential_errors.iterrows():
    color = '#34e912'
    circleMarker = CircleMarker(
    location=(row['latitude'],row['longitude']),
    color=color,
    weight=2
    )
#     m.add_layer(circleMarker)
    EHRI_Portal.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['names']}"

    # Popup associated to a layer
    circleMarker.popup = message
    
m.add_layer(EHRI_Portal)

for index, row in camps_with_potential_errors.iterrows():
    color = '#be00e0'
    circleMarker = CircleMarker(
    location=(row['query_camps_point'][0],row['query_camps_point'][1]),
#     location=(row['geo_lat'],row['geo_long']),
    color=color,
    weight=2
    )
    QUERY_GEOCODE.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['names']}"

    # Popup associated to a layer
    circleMarker.popup = message

m.add_layer(QUERY_GEOCODE)

for index, row in full_camps_with_cc.iterrows():
    color = '#59251e'
    circleMarker = CircleMarker(
    location=(row['with_cc_query_camps_point'][0],row['with_cc_query_camps_point'][1]),
    color=color,
    weight=2
    )
    QUERY_WITH_CC_GEOCODE.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['names']}"

    # Popup associated to a layer
    circleMarker.popup = message

m.add_layer(QUERY_WITH_CC_GEOCODE)

for row in hg_data:
    color = 'blue'
    circleMarker = CircleMarker(
    location=(row['geometry']['coordinates'][1],row['geometry']['coordinates'][0]),
    color=color,
    weight=1,
    fill=False,
    dashArray=1
    )
    HG_SS_CAMPS.add_layer(circleMarker)
    message = HTML()
    message.value = f"{row['id']}<br />Main: {row['properties']['MAIN']}<br />Subcamp: {row['properties']['SUBCAMP']}"

    # Popup associated to a layer
    circleMarker.popup = message

m.add_layer(HG_SS_CAMPS)

control = LayersControl(position='topright')

m.add_control(control)

m.layout.height = '1000px'

m

Map(center=[50.998235, 6.67638], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', '…

The following fields can be used to search for a camp with potential errors by using its name as it appears on the EHRI portal:

In [47]:
camp_name_on_EHRI_portal = "Agdz concentration camp"

In [48]:
camps_with_potential_errors[camps_with_potential_errors["names"]==camp_name_on_EHRI_portal]

Unnamed: 0,id,description,latitude,longitude,broader,narrower,names,alt_names,names_regex,alt_names_regex,query,with_cc_query,query_camps_point,distance_from_EHRI_camp
1165,ehri_camps-2607,"{'name': 'Agdz concentration camp', 'altLabel'...",30.693333,6.446111,[],[],Agdz concentration camp,,Agdz,,Agdz,Agdz concentration camp,"(30.6943155, -6.4489453, 0.0)",1234.768189


Finally, you can export the list of the camps with potential errors to an Excel file and analyse it further. The cell has been commented out because this file is already provided in this GitHub repository.

In [49]:
# camps_with_potential_errors.to_excel("EHRI_camps_with_potential_errors.xlsx")