# CyprusDB generation

This notebook contains all the steps followed to generate the CyprusDB dataset. This ensures its reproducibility.

## Contents

* Toponyms in Greek and Turkish
* Population data for 2011 for all settlements on the island
* Administrative adscription
* Coordinates
* Google IDs for each settlement
* Extra information for areas not controlled by the Republic of Cyprus:
  * Total, male and female population in 2006
  * Subdistricts and municipalities


## Data sources

1) [CyStat's 2011 census](https://www.data.gov.cy/dataset/%CF%80%CE%BB%CE%B7%CE%B8%CF%85%CF%83%CE%BC%CF%8C%CF%82-%CE%BA%CE%B1%CF%84%CE%AC-%CF%84%CF%8C%CF%80%CE%BF-%CE%B4%CE%B9%CE%B1%CE%BC%CE%BF%CE%BD%CE%AE%CF%82-%CE%B1%CF%80%CE%BF%CE%B3%CF%81%CE%B1%CF%86%CE%AE-%CF%80%CE%BB%CE%B7%CE%B8%CF%85%CF%83%CE%BC%CE%BF%CF%8D-2011)
2) [Google Maps Geocoding API](https://developers.google.com/maps/documentation/geocoding/overview?hl=en-419)
3) [TRNC 2011 census](https://www.ktoeos.org/wp-content/uploads/2013/08/nufus_ikinci_.pdf)
4) [TRNC postal codes information](https://web.archive.org/web/20181024005008/http://posta.gov.ct.tr/LinkClick.aspx?fileticket=8SyyJ3rwqeI=&tabid=8099&language=en-US)
5) [OpenStreetMap](https://www.openstreetmap.org/)

## Acknowledgements

This project was supported by [Embassy of the Republic of Cyprus in Madrid](http://www.mfa.gov.cy/mfa/Embassies/Embassy_Madrid.nsf/index_en/www.comeshipping.com.cy) and the Honorate Consulate of the Republic of Cyprus in Malaga, Spain.

# Preparation

In [1]:
### Imports ###

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
import folium

# Data retrieval
import requests
import json

# Data reading
import tabula

# Others
import time
from datetime import date
from pprint import pprint
import os
import glob
import re

# 1) Population list

Source: [CyStat](https://www.data.gov.cy/dataset/%CF%80%CE%BB%CE%B7%CE%B8%CF%85%CF%83%CE%BC%CF%8C%CF%82-%CE%BA%CE%B1%CF%84%CE%AC-%CF%84%CF%8C%CF%80%CE%BF-%CE%B4%CE%B9%CE%B1%CE%BC%CE%BF%CE%BD%CE%AE%CF%82-%CE%B1%CF%80%CE%BF%CE%B3%CF%81%CE%B1%CF%86%CE%AE-%CF%80%CE%BB%CE%B7%CE%B8%CF%85%CF%83%CE%BC%CE%BF%CF%8D-2011)

The base town list is extracted from the above link. The list is then cleaned and the population is extracted from the excel file. The complete list of settlements used in this project is taken from the sheet `Γ2`.

In [2]:
# Open raw file
filepath = r'sources/Census 2011 Excel format/POP_CEN_11-POP_PLACE_RESID-EL-171115.xls'

# Read file
raw_census_df = pd.read_excel(filepath, sheet_name='Γ2', skiprows=3)
raw_census_df.head()

Unnamed: 0.1,Unnamed: 0,ΓΕΩΓ/ΚΟΣ ΚΩΔΙΚΟΣ,"ΕΠΑΡΧΙΑ, ΔΗΜΟΣ/ΚΟΙΝΟΤΗΤΑ ΚΑΙ ΕΝΟΡΙΑ",ΝΟΙΚΟΚΥΡΙΑ,Unnamed: 4,Unnamed: 5,Unnamed: 6,ΙΔΡΥΜΑΤΑ,Unnamed: 8,Unnamed: 9,Unnamed: 10,ΣΥΝΟΛΟ ΠΛΗΘΥΣΜΟΥ,Unnamed: 12,Unnamed: 13
0,,,,ΑΡΙΘΜΟΣ,ΠΛΗΘΥΣΜΟΣ,,,ΑΡΙΘΜΟΣ,ΠΛΗΘΥΣΜΟΣ,,,,,
1,,,,,Σύνολο,Άνδρες,Γυναίκες,,Σύνολο,Άνδρες,Γυναίκες,Σύνολο,Άνδρες,Γυναίκες
2,,,Σύνολο,303242,836566,407228,429338,211,3841,1552,2289,840407,408780,431627
3,,1.0,Επαρχία Λευκωσίας,119203,324952,157307,167645,94,2028,955,1073,326980,158262,168718
4,,1000.0,Δήμος Λευκωσίας,22833,54452,26086,28366,11,562,434,128,55014,26520,28494


## Data cleaning

In [3]:
# Retain only relevant columns
to_retain = [
    'ΓΕΩΓ/ΚΟΣ ΚΩΔΙΚΟΣ',
    'ΕΠΑΡΧΙΑ, ΔΗΜΟΣ/ΚΟΙΝΟΤΗΤΑ ΚΑΙ ΕΝΟΡΙΑ',
    'ΣΥΝΟΛΟ ΠΛΗΘΥΣΜΟΥ',
    'Unnamed: 12', # Males
    'Unnamed: 13' # Fermales
]

census_df = raw_census_df[to_retain].copy()

# Rename columns
to_rename = {
    'ΓΕΩΓ/ΚΟΣ ΚΩΔΙΚΟΣ': 'geo_code',
    'ΕΠΑΡΧΙΑ, ΔΗΜΟΣ/ΚΟΙΝΟΤΗΤΑ ΚΑΙ ΕΝΟΡΙΑ': 'town',
    'ΣΥΝΟΛΟ ΠΛΗΘΥΣΜΟΥ': 'population',
    'Unnamed: 12' : 'male_population',
    'Unnamed: 13' : 'female_population'
}

census_df.rename(columns=to_rename, inplace=True)

# Drop the three first rows for format purposes
census_df.drop([0, 1, 2], inplace=True)

# Drop the last six rows, which provide no information
census_df.drop(census_df.tail(6).index, inplace=True)

# Fill NaNs with 0 in male and female population columns
census_df.fillna(value = {'male_population' : 0, 'female_population' : 0}, inplace=True)

census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population
3,1,Επαρχία Λευκωσίας,326980,158262,168718
4,1000,Δήμος Λευκωσίας,55014,26520,28494
5,100001,Άγιος Ανδρέας,5767,2817,2950
6,100002,Τρυπιώτης,2158,983,1175
7,100003,Νεμπέτ Χανέ,189,86,103


### Create the 'district' column

In [4]:
# Extract the district from the 'town' column and create a new column
# if 'Επαρχία' is in the town name, then the district is the town name
census_df['district'] = census_df['town'].apply(lambda x: x.split(' ')[1] if 'Επαρχία' in x else np.nan)

# Fill the NaN values with the previous value
# using a forward fill
census_df['district'].fillna(method='ffill', inplace=True)

# Remove the rows that contain the district name
census_df = census_df[~census_df['town'].str.contains('Επαρχία')]

census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district
4,1000,Δήμος Λευκωσίας,55014,26520,28494,Λευκωσίας
5,100001,Άγιος Ανδρέας,5767,2817,2950,Λευκωσίας
6,100002,Τρυπιώτης,2158,983,1175,Λευκωσίας
7,100003,Νεμπέτ Χανέ,189,86,103,Λευκωσίας
8,100004,Ταμπάκ Χανέ,299,117,182,Λευκωσίας


### Include suburbs into the main cities
This requieres reversing the genitives of the dimos to the nominative case. Since the number of suburbs is small, we can do this manually


In [5]:
# Map genitive town names to nominative town names
dimos_manual_mapping = {
    # Nicosia
    'Δήμος Λευκωσίας' : 'Λευκωσία',
    'Δήμος Αγίου Δομετίου' : 'Άγιος Δομέτιος',
    'Δήμος Έγκωμης' : 'Έγκωμη',
    'Δήμος Στροβόλου' : 'Στροβόλος',
    'Δήμος Αγλαντζιάς' : 'Αγλαντζιά',
    'Δήμος Λακατάμειας' : 'Λακατάμεια',
    'Δήμος Λατσιών' : 'Λατσιά',
    'Δήμος Ιδαλίου' : 'Δάλι',
    # Larnaka
    'Δήμος Λάρνακας' : 'Λάρνακα',
    'Δήμος Αραδίππου' : 'Αραδίππου',
    # Limassol
    'Δήμος Λεμεσού' : 'Λεμεσός',
    'Δήμος Μέσα Γειτονιάς' : 'Μέσα Γειτονιά',
    'Δήμος Αγίου Αθανασίου' : 'Άγιος Αθανάσιος',
    'Δήμος Γερμασόγειας' : 'Γερμασόγεια',
    'Δήμος Κάτω Πολεμιδιών' : 'Κάτω Πολεμίδια',
    # Paphos
    'Δήμος Πάφου' : 'Πάφος',
    'Δήμος Γεροσκήπου' : 'Γεροσκήπου',
    'Δήμος Πόλεως Χρυσοχούς' : 'Πόλις Χρυσοχούς',
    'Δήμος Πέγειας' : 'Πέγεια'
}

# Create a column indicating whether the town name is a dimos
census_df['is_dimos'] = census_df['town'].apply(lambda x: True if x in dimos_manual_mapping.keys() else False)

# Replace dimos names with settlement names
census_df['town'] = census_df['town'].apply(lambda x: dimos_manual_mapping[x] if x in dimos_manual_mapping.keys() else x)

In [6]:
census_df.head(5)

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos
4,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True
5,100001,Άγιος Ανδρέας,5767,2817,2950,Λευκωσίας,False
6,100002,Τρυπιώτης,2158,983,1175,Λευκωσίας,False
7,100003,Νεμπέτ Χανέ,189,86,103,Λευκωσίας,False
8,100004,Ταμπάκ Χανέ,299,117,182,Λευκωσίας,False


## Remove suburbs from locations

This is done for several reasons:
- Have a consistent data structure: all locations are settlements, not suburbs
- Avoid double population counts

This behaviour can be switched off by setting `remove_suburbs=False`

In [7]:
remove_suburbs = True

# Remove suburbs
# The suburbs are the towns that have a six-digit geo code
if remove_suburbs:
    census_df = census_df[census_df['geo_code'].apply(lambda x: len(str(x)) < 6)].reset_index(drop=True)

In [8]:
census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True


In [9]:
# Set version
version = 1

In [10]:
# Save a checkpoint of the dataframe to a csv file
census_df.to_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv', index=False)

In [11]:
# Load the checkpoint
census_df = pd.read_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv')

# 2) Google Maps API (Geocoding API)

Source: [Google Maps Geocoding API](https://developers.google.com/maps/documentation/geocoding/overview?hl=en-419)

## Notes:
- The API key is stored in a file called `api_key.txt`. This file is not included in the repository. To reproduce the results here, you need to create your own API key and store it in a file called `api_key.txt` in the folder `private_utils`.
- Requests are limited to 2500 per day. 
- Retrieving data from the API is not free. Be aware of the costs when rebuilding the dataset.

In [2]:
# Read API key
with open('private_utils/api_key.txt', 'r') as f:
    api_key = f.read()

## Auxiliary functions

In [13]:
# Extract coordinates for towns in Cyprus from Geocoding API
def extract_coordinates(
        town: str,
        district: str = None,
        api_key: str = api_key, 
        boundaries: list = [34.51, 32.17, 35.73, 34.61]) -> tuple:
    """
    Extract coordinates for a town in Cyprus from Google Maps Geocoding API.

    The boundaries argument 

    Parameters
    ----------
    town : str
        The town name.
    district : str
        The district name.
    api_key : str
        The Google Maps API key.
    boundaries : list
        The boundaries of the search area. It is a list of four floats that represent the
        boundaries of the search area. The order of the floats is as follows:
        [southwest_lat, southwest_lon, northeast_lat, northeast_lon].
        By default, the boundaries are set to the approximate boundaries of Cyprus island.

    Returns
    -------
    lat : float
        The latitude of the town.
    lon : float
        The longitude of the town.
    gm_id : str
        The Google Maps ID of the town.
    """
        
    # Set search term
    query = town

    # Add district to search term
    if district is not None:
        query += ' ' + district

    # Set Geocoding API URL
    url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + query + '&key=' + api_key

    # Add boundaries to search
    if boundaries:
        url += '&bounds=' + str(boundaries[0]) + ',' + str(boundaries[1]) + '|' + str(boundaries[2]) + ',' + str(boundaries[3])

    # Extract coordinates
    response = requests.get(url)
    data = json.loads(response.text)
    
    # Extract coordinates and Google Maps ID
    lat = data['results'][0]['geometry']['location']['lat']
    lon = data['results'][0]['geometry']['location']['lng']
    gm_id = data['results'][0]['place_id']

    return lat, lon, gm_id

## Data retrieval

In [14]:
census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True


In [15]:
# Generate coordinates if requested
generate_coordinates = False

if generate_coordinates:
    # Extract coordinates for towns in Cyprus from Geocoding API
    # Takes ~ 1 minute
    # Latitude, longitude, Google Maps ID
    census_df['lat'], census_df['lon'], census_df['gm_id'] = zip(*census_df.apply(lambda x: extract_coordinates(x['town'], distrinct = x['district']), axis=1))

    # Save coordinates with retrieval date
    census_coordinates = census_df[['town', 'district', 'lat', 'lon', 'gm_id']]
    census_coordinates.to_csv(f'sources/Geocoding API/geocoding_coordinates_ROC_{date.today().strftime("%Y-%m-%d")}.csv', index=False)

else:
    # Select the latest coordinates file
    list_of_files = glob.glob('sources/Geocoding API/geocoding_coordinates_ROC_*.csv')
    latest_file = max(list_of_files, key=os.path.getctime)
    
    # Load coordinates from file
    census_coordinates = pd.read_csv(latest_file)

# Add coordinates to census dataframe
# census_df = census_df.merge(census_coordinates, on=['town', 'district'], how='left')
census_df[['lat', 'lon', 'gm_id']] = census_coordinates[['lat', 'lon', 'gm_id']]

census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0


## Plot and inspect results

In [16]:
# Show the retrieval results to check for errors
# Plot all towns in Cyprus
map = folium.Map(location=[35.1264, 33.4299], zoom_start=9)

for i in range(len(census_df)):
    folium.CircleMarker(
        location=[census_df['lat'][i], census_df['lon'][i]],
        radius=5,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        tooltip=f"{census_df['town'][i]}",
        parse_html=False).add_to(map)
    
map

## Results notes

1) The coordinates for the town of `Πόλη Χρυσοχούς` (Póli Chrysochoús) are not correct. This is due to conflicting names with other natural places and locations and to the fact that the algorithm only takes the first result. Since it is the only instance of such a problem, a manual correction is applied.
2) The coordinates for `Αχερίτου` (Acheritou) and `Άχνα` (Achna) are set to the area not controlled by the ROC. Currently, the entities in the ROC census with those names seem to count the population in the settlements of  `Άγιος Γεώργιος Αχερίτου` (Agios Georgios Acheritou) and `Δασάκι Άχνας`  (Dasaki Achnas). The name similarities lead to difficulty disambiguating when retrieving the coordinates through the API.

In [17]:
# Prepare manual corrections 
coordinates_manual_mapping = {
    'Πόλις Χρυσοχούς' : [35.0339441, 32.4253751, 'ChIJG4zmlxR05xQR7lc0vj1h-YQ'],
    'Αχερίτου' : [35.0732134, 33.871748, 'ChIJ-3bvKvzL3xQRZiFfvsvFLTM'], # Using the GM ID for Vrysoules
    'Άχνα' : [35.0728291, 33.825531, 'ChIJEfGqsrEy3hQR9S8OV6Sf2X0']
}

# Apply manual corrections
for town in coordinates_manual_mapping.keys():
    census_df.loc[census_df['town'] == town, ['lat', 'lon', 'gm_id']] = coordinates_manual_mapping[town]

## Inspect results after correction

In [18]:
map = folium.Map(location=[35.1264, 33.4299], zoom_start=9)

for i in range(len(census_df)):
    folium.CircleMarker(
        location=[census_df['lat'][i], census_df['lon'][i]],
        radius=5,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        # Add text with the town name when hovering over the marker
        tooltip=f"{census_df['town'][i]}",
        parse_html=False).add_to(map)
    
map

## Save checkpoint

In [104]:
# Set version
version = 2

In [20]:
# Save a checkpoint of the dataframe to a csv file
census_df.to_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv', index=False)

In [105]:
# Load the checkpoint
census_df = pd.read_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv')

# 3) Add non-Greek toponyms to ROC-controlled settlements

## English toponyms

In [106]:
# Define a function to extract the Greek name from a Google Maps entity
def extract_english_name(gm_id):
    """Given a Google Maps ID, extract the English name of the entity.
    Used to retrieve the English name of a town controlled by the ROC.
    
    Parameters
    ----------
    gm_id : str
        The Google Maps ID of the entity.
        
    Returns
    -------
    english_name : str
        The English name of the entity."""

    # Set Geocoding API URL
    url = 'https://maps.googleapis.com/maps/api/geocode/json?language=EN&key=' + api_key + '&place_id=' + gm_id

    # Retrieve the entity
    response = requests.get(url)
    entity = json.loads(response.text)

    # Extract the English name
    english_name = entity['results'][0]['address_components'][0]['long_name']
    print(f'English name for {gm_id} is {english_name}')
    
    return english_name

In [107]:
generate_names = False

if generate_names:
    # Apply the function to the dataframe
    census_df['english_name'] = census_df['gm_id'].apply(lambda x: extract_english_name(x) if x is not np.nan else np.nan)

    # Extract a table with the town names, the Google IDs and the Greek names
    english_names_df = census_df[['town', 'gm_id', 'english_name']]
    english_names_df.to_csv(f'sources/Geocoding API/geocoding_english_toponyms__{date.today().strftime("%Y-%m-%d")}.csv', index=False)

else: 
    # Select the latest version of the Greek names table
    list_of_files = glob.glob('sources/Geocoding API/geocoding_english_toponyms_*.csv')
    latest_file = max(list_of_files, key=os.path.getctime)

    # Load coordinates from file
    english_names_df = pd.read_csv(latest_file)

    # Add the Greek names table
    census_df = census_df.merge(english_names_df, on=['town', 'gm_id'], how='left')

# Show the dataframe
census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia


In [108]:
census_df

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia
5,1021,Λακατάμεια,38345,18674,19671,Λευκωσίας,True,35.118566,33.314538,ChIJvfpe0pgb3hQRzVO7VGoRPls,Lakatamia
6,1022,Συνοικισμός Ανθούπολης,1756,767,989,Λευκωσίας,False,35.113951,33.289679,ChIJB3reYWMb3hQRdCFNyi8yBbM,Anthoupolis
7,1023,Λατσιά,16774,8132,8642,Λευκωσίας,True,35.106364,33.378267,ChIJbbBNlTcZ3hQRO3tmqqvDS3E,Latsia
8,1024,Γέρι,8235,4065,4170,Λευκωσίας,False,35.10757,33.422429,ChIJea8LrGAY3hQRBtiaqfqbnPM,Geri
9,1100,Σιά,754,368,386,Λευκωσίας,False,34.955289,33.389474,ChIJyaqxJB2h4BQR4xR_8HkompQ,Sha


In [111]:
## Apply manual corrections
english_names_manual_mapping = {
    'Σιά' : 'Sia',
    'Καταλιόντας' : 'Kataliontas',
    'Λουρουκίνα' : 'Louroukina',
    'Παλαιχώρι Ορεινής' : 'Palaichori Oreinis',
    'Πάνω Ζώδεια' : 'Pano Zodeia'}

# Apply manual corrections
for town in english_names_manual_mapping.keys():
    census_df.loc[census_df['town'] == town, 'english_name'] = english_names_manual_mapping[town]

## Turkish toponyms

A significant portion of the settlements controlled by the ROC have Turkish names. These are added to the dataset based on information from the Geocoding API.

In [28]:
# Define a function to extract the Greek name from a Google Maps entity
def extract_turkish_name(gm_id):
    """Given a Google Maps ID, extract the Turkish name of the entity.
    Used to retrieve the Turkish name of a town controlled by the ROC.
    
    Parameters
    ----------
    gm_id : str
        The Google Maps ID of the entity.
        
    Returns
    -------
    turkish_name : str
        The Turkish name of the entity."""

    # Set Geocoding API URL
    url = 'https://maps.googleapis.com/maps/api/geocode/json?language=TR&key=' + api_key + '&place_id=' + gm_id

    # Retrieve the entity
    response = requests.get(url)
    entity = json.loads(response.text)

    # Extract the Turkish name
    turkish_name = entity['results'][0]['address_components'][0]['long_name']
    print(f'Turkish name for {gm_id} is {turkish_name}')
    
    return turkish_name

In [112]:
generate_names = False

if generate_names:
    # Apply the function to the dataframe
    census_df['turkish_name'] = census_df['gm_id'].apply(lambda x: extract_turkish_name(x) if x is not np.nan else np.nan)

    # Extract a table with the town names, the Google IDs and the Greek names
    turkish_names_df = census_df[['town', 'gm_id', 'turkish_name']]
    turkish_names_df.to_csv(f'sources/Geocoding API/geocoding_turkish_toponyms__{date.today().strftime("%Y-%m-%d")}.csv', index=False)

else:
    # Select the latest version of the Greek names table
    list_of_files = glob.glob('sources/Geocoding API/geocoding_turkish_toponyms_*.csv')
    latest_file = max(list_of_files, key=os.path.getctime)

    # Load coordinates from file
    turkish_names_df = pd.read_csv(latest_file)

    # Add the Greek names table
    census_df = census_df.merge(turkish_names_df, on=['town', 'gm_id'], how='left')

# Show the dataframe
census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name,turkish_name
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia,Lefkoşa
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios,Aydemet
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi,İncirli
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos,Strovolos
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia,Atalassa


In [113]:
census_df

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name,turkish_name
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia,Lefkoşa
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios,Aydemet
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi,İncirli
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos,Strovolos
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia,Atalassa
5,1021,Λακατάμεια,38345,18674,19671,Λευκωσίας,True,35.118566,33.314538,ChIJvfpe0pgb3hQRzVO7VGoRPls,Lakatamia,Lakadamya
6,1022,Συνοικισμός Ανθούπολης,1756,767,989,Λευκωσίας,False,35.113951,33.289679,ChIJB3reYWMb3hQRdCFNyi8yBbM,Anthoupolis,Anthoupolis
7,1023,Λατσιά,16774,8132,8642,Λευκωσίας,True,35.106364,33.378267,ChIJbbBNlTcZ3hQRO3tmqqvDS3E,Latsia,Uluyurt
8,1024,Γέρι,8235,4065,4170,Λευκωσίας,False,35.10757,33.422429,ChIJea8LrGAY3hQRBtiaqfqbnPM,Geri,Yeri
9,1100,Σιά,754,368,386,Λευκωσίας,False,34.955289,33.389474,ChIJyaqxJB2h4BQR4xR_8HkompQ,Sia,Sha


## Inspect and correct results

In [114]:
# Set manual name corrections
turkish_name_corrections = {
    'Καταλιόντας' : 'Kataliontas',
}

# Apply manual corrections
for town, turkish_name in turkish_name_corrections.items():
    census_df.loc[census_df['town'] == town, 'turkish_name'] = turkish_name

## Save checkpoint

In [115]:
# Set version
version = 3

In [116]:
# Save a checkpoint of the dataframe to a csv file
census_df.to_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv', index=False)

In [117]:
# Load the checkpoint
census_df = pd.read_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv')

# 4) Remove location words and other addings in names

Location words refer to words that describe the administrative unit a town belongs to. Usually, it is used to distringuish a town from another sharing the same name in another part of the country and take the form of the genitive version of a bigger settlement or area. This, however, might make harder to analyze the town names itself. Thus, they will be removed.

In [118]:
# Define a list with the genitive forms of the districts
genitive_districts = census_df['district'].unique().tolist()
genitive_districts

# Add other genitive forms used in the census
genitive_others = [
    'Κελοκεδάρων',
    'Χρυσοχούς',
    ' Αυδήμου', # Make sure that there is a space before the name to avoid matching Αυδήμου Αυδήμου
    'Καυκάλλου'
    ]
all_genitives = genitive_districts + genitive_others

# Remove the genitive forms of the districts from town names
for genitive in all_genitives:
    census_df['town'] = census_df['town'].str.replace(genitive, '')

# Show the dataframe
census_df.head()


Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name,turkish_name
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia,Lefkoşa
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios,Aydemet
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi,İncirli
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos,Strovolos
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia,Atalassa


In [119]:
# Remove parentheses and other extra information in the town names
parenthesis_regex = re.compile(r'\([^)]*\)') # Matches anything between parentheses

# Apply the regex to the town names
census_df['town'] = census_df['town'].str.replace(parenthesis_regex, '')

# Show the dataframe
census_df.head()

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name,turkish_name
0,1000,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia,Lefkoşa
1,1010,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios,Aydemet
2,1011,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi,İncirli
3,1012,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos,Strovolos
4,1013,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia,Atalassa


As a special case, the town of `Ορόκληνη` (Oroklini) is listed as `Bορόκληνη` (Voroklini). To keep it aligned with usual name, we will keep `Ορόκληνη`.

In [120]:
# Replace Voroklini with Oroklini
census_df['town'] = census_df['town'].str.replace('Bορόκληνη', 'Ορόκληνη ')

## Save checkpoint

In [121]:
# Set version
version = 4

In [122]:
# Save a checkpoint of the dataframe to a csv file
census_df.to_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv', index=False)

In [123]:
# Load the checkpoint
census_df = pd.read_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv')

# 5) Add census data for TRNC-controlled areas

Source: [2011 Census of TRNC-controlled areas](https://www.ktoeos.org/wp-content/uploads/2013/08/nufus_ikinci_.pdf)

The data is extracted from the above PDF. Target table is table 5, which contains population information for each town for years 2006 and 2011. Pages 25 to 33 contain the data. 

## Read pages

In [231]:
# Read each individual page from the PDF file
# The tables are in pages 25 to 33
pages = tabula.read_pdf('sources/TRNC Census 2011/nufus_ikinci_.pdf', pages='25-33')

# Delete the three first rows from every page
# They contain general information
for page in pages:
    page.drop(page.index[:3], inplace=True)

## Clean pages

In [232]:
# Concatenate all pages into a single dataframe
census_df_trnc = pd.concat(pages, ignore_index=True)

# Rename columns
# 'mixed_data' contains several columns together due to OCR errors 
column_names = ['town', 'population_2006', 'mixed_data', 'male_population', 'female_population']
census_df_trnc.columns = column_names

# Separete the columns in 'mixed_data'
new_columns = ['male_population_2006', 'female_population_2006', 'population']
census_df_trnc[new_columns] = census_df_trnc['mixed_data'].str.split(' ', expand=True)

# Reorder columns
order = ['town', 
         'population', 'male_population', 'female_population', 
         'population_2006', 'male_population_2006', 'female_population_2006']
census_df_trnc = census_df_trnc[order]

# Remove rows where the town is not specified
# Those rows are headers
census_df_trnc = census_df_trnc[census_df_trnc['town'].notna()]

# Convert selected columns to numeric
numeric_columns = [
    'population', 'male_population', 'female_population', 
    'population_2006', 'male_population_2006', 'female_population_2006']
for column in numeric_columns:
    # Remove ',' in numbers
    census_df_trnc[column] = census_df_trnc[column].str.replace(',', '').astype(int)

# Show all rows
census_df_trnc.head()

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006
1,Lefkoşa İlçe Toplamı,94824,49838,44986,84776,46187,38589
2,Lefkoşa Merkez Bucak Toplamı,82929,43628,39301,72479,39337,33142
3,Lefkoşa Belediye Toplamı,61378,32260,29118,56146,30583,25563
4,ABDİ ÇAVUŞ,568,315,253,975,591,384
5,AKKAVUK,793,458,335,898,498,400


## Extract districts, subdistricts and municipalities

In [233]:
# Create a columns named 'district'
# If 'İlçe' ('district') is in the town name, extract it
census_df_trnc['district'] = census_df_trnc['town'].apply(lambda x: x.split(' İlçe ')[0] if ' İlçe' in x else np.nan)

# Apply a forward fill to fill the missing values
census_df_trnc['district'] = census_df_trnc['district'].ffill()

In [234]:
# Create a column for subdistricts
# If 'Bucak' ('subdistrict') is in the town name, extract it
census_df_trnc['subdistrict'] = census_df_trnc['town'].apply(lambda x: x.split(' ')[0] if ' Bucak' in x else np.nan)

# Apply a forward fill to fill the missing values
census_df_trnc['subdistrict'] = census_df_trnc['subdistrict'].ffill() 

In [235]:
# Create a column for municipalities
# If 'Belediye' ('municipality') is in the town name, extract it
census_df_trnc['municipality'] = census_df_trnc['town'].apply(lambda x: x.split(' ')[0] if ' Belediye' in x else np.nan)

# Apply a forward fill to fill the missing values
census_df_trnc['municipality'] = census_df_trnc['municipality'].ffill()

## Final cleaning

In [236]:
# Remove rows including 'Toplamı' ('Total')
census_df_trnc = census_df_trnc.loc[~census_df_trnc['town'].str.contains('Toplamı')]

Certain towns requiere name corrections in order to be processed by the Geocoding API in future steps. Otherwise, the coordinates are not retrieved.

* **KAPALI MARAŞ** ("Closed Maraş"): MARAŞ
* **MALATYA - İNCESU**: İNCESU
* **KARAMAN (YUKARI KARMİ)** ("Upper Karmi") : KARAMAN 

In [237]:
# Set towns to correct
town_corrections = {
    # Raw census name : Corrected name
    'KAPALI MARAŞ' : 'MARAŞ',
    'MALATYA - İNCESU' : 'İNCESU',
    'KARAMAN (YUKARI KARMİ)' : 'KARAMAN'
}

# Apply corrections
for old_name, new_name in town_corrections.items():
    census_df_trnc.loc[census_df_trnc['town'] == old_name, 'town'] = new_name

## Save checkpoint

In a secondary folder prior to merger with the main dataset

In [238]:
# Set version
version = 1

In [239]:
# Save a checkpoint of the dataframe to a csv file
census_df_trnc.to_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv', index=False)

In [240]:
# Load the checkpoint
census_df_trnc = pd.read_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv')

## Read pages

In [241]:
# Read each individual page from the PDF file
postal_codes_pages = tabula.read_pdf('sources\Postal codes TRNC/KKTC Posta Kodlari.pdf', pages = 'all')

# Concatenate all pages into a single dataframe
raw_postal_codes_df = pd.concat(postal_codes_pages, ignore_index=True)

## Clean pages

In [242]:
# Create a copy of the dataframe
postal_codes_df = raw_postal_codes_df.copy()

# Drop columns contaning 'İLÇESİ'
postal_codes_df.drop(columns=postal_codes_df.columns[postal_codes_df.columns.str.contains('İLÇESİ')], inplace=True)

# Drop the first column
postal_codes_df.drop(columns=postal_codes_df.columns[0], inplace=True)

# Rename columns
column_renames = {
    'Unnamed: 0': 'district',
    'Unnamed: 1': 'subdistrict',
    'Unnamed: 2': 'entity',
    'Unnamed: 3': 'postal_code'
}
postal_codes_df.rename(columns=column_renames, inplace=True)

# Remove rows where the district is not specified
# Those rows are headers
postal_codes_df = postal_codes_df[postal_codes_df['district'].notna()]

# Remove rows where the postal code is 'PK'
# Those rows are headers
postal_codes_df = postal_codes_df[postal_codes_df['postal_code'] != 'PK']

# Replace GAZİ.MAĞUSA with GAZİMAĞUSA
postal_codes_df['subdistrict'] = postal_codes_df['subdistrict'].str.replace('GAZİ.MAĞUSA', 'GAZİMAĞUSA', regex = False)

# Replace G.MAĞUSA with GAZİMAĞUSA in the district column
postal_codes_df['district'] = postal_codes_df['district'].str.replace('G.MAĞUSA', 'GAZİMAĞUSA', regex = False)

## Homogenize district names

Altough visually equal, the district names are not equal in the two datasets in terms of ASCII characters. To simplify the mergind process, a manual mapping between the two datasets is required.

In [243]:
# Show differences
# display([ord(c) for c in 'Gazimağusa'], 
#         [ord(c) for c in 'GAZİMAĞUSA'.lower()])

# Create a dictionary to map the district names
district_ascii_mapppings = {
    'LEFKOŞA' : 'Lefkoşa',
    'GİRNE' : 'Girne',
    'GAZİMAĞUSA' : 'Gazimağusa',
    'GÜZELYURT' : 'Güzelyurt',
    'İSKELE' : 'İskele'
}

# Map the district names
postal_codes_df['district'] = postal_codes_df['district'].map(district_ascii_mapppings)

In [244]:
# Show table
postal_codes_df.head()

Unnamed: 0,district,subdistrict,entity,postal_code
2,Lefkoşa,LEFKOŞA,ABDİ ÇAVUŞ MAH,99010
3,Lefkoşa,LEFKOŞA,AKKAVUK MAH,99010
4,Lefkoşa,LEFKOŞA,ARABAHMET MAH,99010
5,Lefkoşa,LEFKOŞA,AYDEMET MAH,99010
6,Lefkoşa,LEFKOŞA,AYYILDIZ MAH,99010


## Quarter merging

It is assumed that if the `entity` column contains 'MAH' or 'MAHALLESİ', it is a quarter of the town that appears in the `bucak` column.

Exceptions detected through manual inspection:
- `İPLİK PAZARI/K.EFENDİ` is a quarter of `Lefkoşa`
- `DENİZLİ` is considered a separate town from GEMİKONAĞI in Google Maps and as its own ID.


Further manual inspection might be necessary to detect more exceptions.

In [245]:
# Manual corrections prior to merging

# Add ' MAH' at the end of the row contaning 'İPLİK PAZARI/K.EFENDİ'
postal_codes_df.loc[postal_codes_df['entity'] == 'İPLİK PAZARI/K.EFENDİ', 'entity'] = 'İPLİK PAZARI/K.EFENDİ MAH'

# Replace 'MAH' with 'KÖYÜ' in the row containing 'DENİZLİ MAH' to avoif being merged with 'GEMİKONAĞI MAH'
postal_codes_df.loc[postal_codes_df['entity'] == 'DENİZLİ MAH', 'entity'] = 'DENİZLİ KÖYÜ'

# Create a backup of the dataframe
postal_codes_df_backup = postal_codes_df.copy()

In [246]:
# Create a copy of the base dataframe
postal_codes_df = postal_codes_df_backup.copy()

# Drop all rows where the entity does not contain 'MAH'
postal_codes_df = postal_codes_df[postal_codes_df['entity'].str.contains('MAH')]

# Create a new columns called 'quarter_name' and populate it with the entity name removing the last word
postal_codes_df['quarter_name'] = postal_codes_df['entity'].apply(lambda x: ' '.join(x.split(' ')[:-1]))

postal_codes_df

Unnamed: 0,district,subdistrict,entity,postal_code,quarter_name
2,Lefkoşa,LEFKOŞA,ABDİ ÇAVUŞ MAH,99010,ABDİ ÇAVUŞ
3,Lefkoşa,LEFKOŞA,AKKAVUK MAH,99010,AKKAVUK
4,Lefkoşa,LEFKOŞA,ARABAHMET MAH,99010,ARABAHMET
5,Lefkoşa,LEFKOŞA,AYDEMET MAH,99010,AYDEMET
6,Lefkoşa,LEFKOŞA,AYYILDIZ MAH,99010,AYYILDIZ
7,Lefkoşa,LEFKOŞA,ÇAĞLAYAN MAH,99010,ÇAĞLAYAN
8,Lefkoşa,LEFKOŞA,HAYDARPAŞA MAH,99010,HAYDARPAŞA
9,Lefkoşa,LEFKOŞA,GÖÇMENKÖY MAH,99010,GÖÇMENKÖY
10,Lefkoşa,LEFKOŞA,İBRAHİM PAŞA MAH,99010,İBRAHİM PAŞA
11,Lefkoşa,LEFKOŞA,İPLİK PAZARI/K.EFENDİ MAH,99010,İPLİK PAZARI/K.EFENDİ


In [247]:
# Create a new base dataframe to ensure repeatability
clean_postal_codes_df = postal_codes_df.copy()

# Create a new column indicating whether the quarter is included in the general census
clean_postal_codes_df['included_in_census'] = clean_postal_codes_df['quarter_name'].isin(census_df_trnc['town'])

# Set the index to the quarter name
clean_postal_codes_df.set_index('quarter_name', inplace=True)

clean_postal_codes_df

Unnamed: 0_level_0,district,subdistrict,entity,postal_code,included_in_census
quarter_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ABDİ ÇAVUŞ,Lefkoşa,LEFKOŞA,ABDİ ÇAVUŞ MAH,99010,True
AKKAVUK,Lefkoşa,LEFKOŞA,AKKAVUK MAH,99010,True
ARABAHMET,Lefkoşa,LEFKOŞA,ARABAHMET MAH,99010,True
AYDEMET,Lefkoşa,LEFKOŞA,AYDEMET MAH,99010,True
AYYILDIZ,Lefkoşa,LEFKOŞA,AYYILDIZ MAH,99010,True
ÇAĞLAYAN,Lefkoşa,LEFKOŞA,ÇAĞLAYAN MAH,99010,False
HAYDARPAŞA,Lefkoşa,LEFKOŞA,HAYDARPAŞA MAH,99010,True
GÖÇMENKÖY,Lefkoşa,LEFKOŞA,GÖÇMENKÖY MAH,99010,True
İBRAHİM PAŞA,Lefkoşa,LEFKOŞA,İBRAHİM PAŞA MAH,99010,False
İPLİK PAZARI/K.EFENDİ,Lefkoşa,LEFKOŞA,İPLİK PAZARI/K.EFENDİ MAH,99010,False


## Modify the general TRNC census dataframe adding quarter information from the postal codes dataframe

In [248]:
# Add a new column in the general TRNC census dataframe indicating the main city-subdistrict (bucak) to which the quarter belongs
def get_main_city(town: str,
                  district: str,
                  quarter_df = clean_postal_codes_df) -> str:
    
    # Filter by district first to avoid duplicates
    # The district must be uppercase to match formats
    district_df = quarter_df.loc[quarter_df['district'] == district]

    # Get the main city to which it belongs, which appeares in the 'subdistrict' column
    if town in district_df.index:
        main_city = district_df.loc[town, 'subdistrict']
        
    else:
        main_city = np.nan

    return main_city

census_df_trnc['main_city'] = census_df_trnc.apply(lambda x: get_main_city(x['town'], x['district']), axis=1)

In [249]:
# Show results
census_df_trnc.head()

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,main_city
0,ABDİ ÇAVUŞ,568,315,253,975,591,384,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
1,AKKAVUK,793,458,335,898,498,400,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
2,ARABAHMET,561,297,264,761,425,336,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
3,AYDEMET,2314,1147,1167,1550,765,785,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
4,AYYILDIZ,489,271,218,559,316,243,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA


### Quarter manual correction

Not all quarters are included in the census or have not been detected. Thus, a manual correction is required. 

The correspondences are established by manually comparing the two datasets: all the quarters in `clean_postal_codes_df` with a value of `False` for the `included_in_census` column are manually checked against the town names in the `census_df_trnc` census dataset to detect naming inconsistencies and assign a main city to them.

In [250]:
# Create a dictionary with the quarters to add to each main city
quarter_corrections = {
    # Main city : quarters to add (namings as shown in the census_df_trnc dataframe)
    # Nicosia
    'LEFKOŞA' : ['ÇAĞLAYAN',
                 'İBRAHİMPAŞA',
                 'İPLİKPAZARI',
                 'KÖŞKLÜÇİFTLİK',
                 'MAHMUTPAŞA'],
    # Kioneli (Gönyeli)
    'GÖNYELİ' : ['GÖNYELİ',
                 'YENİKENT'],
    # Kyrenia (Girne)
    # This instance may be confused with 'ZEYTİNLİK KÖY'. 
    # Differences are to be determined and might be reviewed in the future.
    'GİRNE' : ['ZEYTİNLİK KESİM'], 
    # Alsancak
    'ALSANCAK' : ['YAYLA'],
    # Lapithos (Lapta)
    'LAPTA' : ['SAKARYA',
               'YAVUZ'],
    # Famagusta (Gazimağusa)
    'GAZİMAĞUSA' : ['CANBOLAT',
                    'SURİÇİ'],
    # Tríkomo (İskele)
    'İSKELE' : ['CEVİZLİ'],
    # Rizokarpaso (Dipkarpaz)
    'DİPKARPAZ' : ['POLAT PAŞA']                
}

In [251]:
# Add the quarters to the main cities
for main_city, quarters in quarter_corrections.items():
    for quarter in quarters:
        census_df_trnc.loc[census_df_trnc['town'] == quarter, 'main_city'] = main_city

# Some quarter names are shared between different main cities and towns. 
# However, the above code assigns all quarters with the same name to the same main city.
# To account for this, the following manual corrections are needed:
quarter_extra_corrections = [
    # Settlement, Municipality, Main city value
    ('YAYLA', 'Güzelyurt', np.nan),
    ('ÇAĞLAYAN', 'Alsancak', 'ALSANCAK'),
    ('SAKARYA', 'Gazimağusa', 'GAZİMAĞUSA')
]

for settlement, municipality, main_city_value in quarter_extra_corrections:
    census_df_trnc.loc[(census_df_trnc['town'] == settlement) & (census_df_trnc['municipality'] == municipality), 'main_city'] = main_city_value    

census_df_trnc

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,main_city
0,ABDİ ÇAVUŞ,568,315,253,975,591,384,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
1,AKKAVUK,793,458,335,898,498,400,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
2,ARABAHMET,561,297,264,761,425,336,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
3,AYDEMET,2314,1147,1167,1550,765,785,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
4,AYYILDIZ,489,271,218,559,316,243,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
5,ÇAĞLAYAN,1307,667,640,1413,744,669,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
6,GÖÇMENKÖY,3003,1551,1452,2946,1526,1420,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA
7,HAMİTKÖY,5338,2773,2565,2898,1567,1331,Lefkoşa,Lefkoşa,Lefkoşa,
8,HASPOLAT,4204,2385,1819,3380,2168,1212,Lefkoşa,Lefkoşa,Lefkoşa,
9,HAYDARPAŞA,155,80,75,320,186,134,Lefkoşa,Lefkoşa,Lefkoşa,LEFKOŞA


Certain locations are divided in "upper" ("Yukarı") and "lower" ("Aşağı") parts. To help with coordinate retrieval in future steps, those instances will be assigned a main city, so that they are merged into a single entity recognized by the Geocoding API.

Divided locations:
- Girne
- Dikmen

In [252]:
divided_locations = ['GİRNE', 'DİKMEN']
location_prefixes = ['YUKARI', 'AŞAĞI']

for location in divided_locations:
    for prefix in location_prefixes:
        census_df_trnc.loc[census_df_trnc['town'] == f'{prefix} {location}', 'main_city'] = location

Additionally, the city of Trikomo (İskele) has an entity in the census, but it has no `main_city` value. To improve the quarter merging process, it is set to the city name.

In [253]:
# Add İSKELE as a main city to its row in the census dataframe
census_df_trnc.loc[census_df_trnc['town'] == 'İSKELE', 'main_city'] = 'İSKELE'

## Merge all quarters to their main cities

To do so, the values for all the rows that share a `main_city` value in `census_df_trnc` are summed up, resulting in a new row for the city is created.

There are two possibilities for these values:
1) There is no instance in the `town` column with the name of the city in the census. In this case, the row is added to the dataframe and all the rows are summed up.
2) There is an instance with that name in census dataset, further divided in two cases:
   - There is only one instance that has that name as its `main_city` that name. In this case, the process is skipped.
   - There is more than one instance that has that name as its `main_city`. In this case, the rows are summed up and the result is added to the dataframe.

All these cases are handled by the `merge_quarter_data` function.

In [254]:
# Get unique values for the main cities without NaN
main_cities = [city for city in census_df_trnc['main_city'].unique() if city is not np.nan]
print(f'Main cities: {main_cities}')

# Create an auxiliary dataframe to store the results
aux_census_df_trnc = census_df_trnc.copy()

Main cities: ['LEFKOŞA', 'GÖNYELİ', 'DEĞİRMENLİK', 'GAZİMAĞUSA', 'TATLISU', 'GİRNE', 'ALSANCAK', 'DİKMEN', 'LAPTA', 'GÜZELYURT', 'GEMİKONAĞI', 'İSKELE', 'DİPKARPAZ']


### Merge main cities that have no instance in the census

In [255]:
def merge_quarter_data(
        cities: list, 
        census_df: pd.DataFrame = aux_census_df_trnc) -> pd.DataFrame:
    """
    Computes the total population for each main city and adds administrative information 
    based on a list of main cities. It also deletes the information for the quarters that
    belong to the main cities.

    Parameters
    ----------
    cities : list
        List of main cities to be processed.
    census_df : pd.DataFrame
        Dataframe containing the census data. By default,
        the auxiliary dataframe is used to avoid overwriting
        the original data.

    Returns
    -------
    census_df: pd.DataFrame
        A modified version of the original dataframe with the
        population data for each main city. 
    """

    for city in cities:
        # Extract city quarters
        city_filter = census_df[census_df['main_city'] == city]

        # Drop all rows for which `main_city` is `city`
        census_df = census_df[census_df['main_city'] != city]

        # Store administrative information
        city_admin_info = city_filter[['district', 'subdistrict', 'municipality', 'main_city']].copy()
        # Use this method to confirm that data belong to the same district and municipality
        # If two rows are returned, the data is not consistent
        city_admin_info.drop_duplicates(inplace=True) 

        # Sum population rows
        city_population = (city_filter
                        .groupby('main_city')
                        .sum()
                        .reset_index())

        # Add administrative information
        city_info = pd.merge(city_population, city_admin_info, on='main_city')

        # Extract town name from 'main_city' column
        city_info['town'] = city_info['main_city']

        # Reorder columns
        order = [
            'town', 
            'population', 'male_population', 'female_population', 
            'population_2006', 'male_population_2006', 'female_population_2006',
            'district', 'subdistrict', 'municipality', 'main_city'
            ]
        city_info = city_info[order]

        # Add city to the census
        census_df = census_df.append(city_info)

    # Drop 'main_city' column
    census_df.drop(columns='main_city', inplace=True)

    return census_df

In [256]:
# Apply quarter merging function
aux_census_df_trnc = merge_quarter_data(main_cities, aux_census_df_trnc)

# Visual check previous to overwriting the original dataframe
aux_census_df_trnc.tail(12)

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality
0,GÖNYELİ,17045,8966,8079,12186,6519,5667,Lefkoşa,Lefkoşa,Gönyeli
0,DEĞİRMENLİK,3284,1778,1506,3128,1685,1443,Lefkoşa,Değirmenlik,Değimenlik
0,GAZİMAĞUSA,37642,20301,17341,33001,17897,15104,Gazimağusa,Gazimağusa,Gazimağusa
0,TATLISU,1120,579,541,903,461,442,Gazimağusa,Geçitkale,Tatlısu
0,GİRNE,21319,11403,9916,18744,10594,8150,Girne,Girne,Girne
0,ALSANCAK,5595,2948,2647,4638,2577,2061,Girne,Girne,Alsancak
0,DİKMEN,3969,2131,1838,2605,1464,1141,Girne,Girne,Dikmen
0,LAPTA,5748,2959,2789,5658,3151,2507,Girne,Girne,Lapta
0,GÜZELYURT,7251,3619,3632,7627,3885,3742,Güzelyurt,Güzelyurt,Güzelyurt
0,GEMİKONAĞI,2075,1318,757,1498,854,644,Güzelyurt,Lefke,Lefke


In [257]:
# Overwrite the original dataframe
census_df_trnc = (aux_census_df_trnc
                  .sort_index()
                  .reset_index(drop=True)
                  .copy())

census_df_trnc.head()

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality
0,DİPKARPAZ,2016,1007,1009,1935,968,967,İskele,Yeni,Dipkarpaz
1,GEMİKONAĞI,2075,1318,757,1498,854,644,Güzelyurt,Lefke,Lefke
2,GÜZELYURT,7251,3619,3632,7627,3885,3742,Güzelyurt,Güzelyurt,Güzelyurt
3,LAPTA,5748,2959,2789,5658,3151,2507,Girne,Girne,Lapta
4,DİKMEN,3969,2131,1838,2605,1464,1141,Girne,Girne,Dikmen


## Save checkpoint

In [258]:
# Save a checkpoint of the dataframe to a csv file
version = 2

census_df_trnc.to_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv', index=False)

In [259]:
# Load the checkpoint
census_df_trnc = pd.read_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv')

## Data cleaning prior to adding new sources

This mainly involves lowercasing strings, since TRNC data sources use capital letters for town naming.

### Lowercase town names

We can not apply a simple lowercase function, Since 'ı' is merged with 'i'. Consequently, we need to first replace 'I' with 'ı' and then lowercase the whole string.


In [260]:
def lowercase_turkish_town_name(town: str) -> str:
    """Lowercases the Turkish town names in the dataframe.

    Parameters
    ----------
    town : str
        The town name to be lowercased.

    Returns
    -------
    str
        The lowercased town name.

    """

    # Split the town name into words
    words = town.split(' ')
    
    processed_words = []
    for word in words:
        # Separate word parts
        initial = word[0]
        body = word[1:]

        # Replace 'I' with 'ı' in the body, then lowercase it
        body = body.replace('I', 'ı').lower()

        # Join the word parts back together
        word = initial + body

        processed_words.append(word)
        

    # Join the words back into a string
    town = ' '.join(processed_words)

    return town

In [261]:
# Apply lowercasing function
census_df_trnc['town'] = census_df_trnc['town'].apply(lowercase_turkish_town_name)

## Save checkpoint

In [262]:
# Set version
version = 3

In [263]:
# Save a checkpoint of the dataframe to a csv file
census_df_trnc.to_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv', index=False)

In [264]:
# Load the checkpoint
census_df_trnc = pd.read_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv')

# 6) Google Maps API for northern Cyprus

### Important notes

Data for settlements in northern Cyprus is heavily unreliable: the coordinates offered point to the center of the administrative unit and not to the settlement itself. Altough the coordinates can be obtained through OpenStreetMap API, the Google Maps API is still needed to get consistent Greek and Turkish namings for towns. As such, coordinates will be generated and will be overwritten by OpenStreetMap data later on.

In [265]:
# Extract coordinates for towns in Cyprus from Geocoding API
def extract_coordinates(
        town: str,
        district: str = None,
        language: str = None,
        api_key: str = api_key, 
        boundaries: list = [34.51, 32.17, 35.73, 34.61]) -> tuple:
    """
    Extract coordinates for a town in Cyprus from Google Maps Geocoding API.

    The boundaries argument 

    Parameters
    ----------
    town : str
        The town name.
    district : str
        The district name.
    api_key : str
        The Google Maps API key.
    boundaries : list
        The boundaries of the search area. It is a list of four floats that represent the
        boundaries of the search area. The order of the floats is as follows:
        [southwest_lat, southwest_lon, northeast_lat, northeast_lon].
        By default, the boundaries are set to the approximate boundaries of Cyprus island.

    Returns
    -------
    lat : float
        The latitude of the town.
    lon : float
        The longitude of the town.
    gm_id : str
        The Google Maps ID of the town.
    """
        
    # Set search term
    query = town

    # Add district to search term
    if district is not None:
        query += ' ' + district
        print(query)

    # Set Geocoding API URL
    url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + query + '&key=' + api_key

    # Add boundaries to search
    if boundaries:
        url += '&bounds=' + str(boundaries[0]) + ',' + str(boundaries[1]) + '|' + str(boundaries[2]) + ',' + str(boundaries[3])
    
    # Add language
    if language is not None:
        query += ' ' + language

    ### Request execution ###

    # Extract coordinates
    response = requests.get(url)
    data = json.loads(response.text)
    
    # Extract coordinates and Google Maps ID
    try:
        
        # If there are no results with 'locality', return NaNs
        if len(data['results']) == 0:
            print('Unable to extract coordinates for ' + town + ': no results found.')
            return np.nan, np.nan, np.nan

        # Get the first results that contains 'locality' in the types list and
        # is within the boundaries
        valid_i = None
        for i, result in enumerate(data['results']):
            # Check that the result contains 'locality'
            if 'locality' in result['types']:
                # Check that the result is within the boundaries
                lat = data['results'][i]['geometry']['location']['lat']
                lon = data['results'][i]['geometry']['location']['lng']
                if lat < boundaries[0] or lat > boundaries[2] or lon < boundaries[1] or lon > boundaries[3]:
                    continue

                # If all checks are valid, set the index of the result
                valid_i = i
                break

            # If there are no results, return NaNs
            elif i == len(data['results']) - 1:
                print('Unable to extract coordinates for ' + town + ': no locality results found.')
                return np.nan, np.nan, np.nan
            
        # If there are no valid results, return NaNs
        if valid_i is None:
            print('Unable to extract coordinates for ' + town + ': no valid results found.')
            return np.nan, np.nan, np.nan

        # Extract coordinates and Google Maps ID
        lat = data['results'][valid_i]['geometry']['location']['lat']
        lon = data['results'][valid_i]['geometry']['location']['lng']
        gm_id = data['results'][valid_i]['place_id']

    except ValueError:
        print('Unable to extract coordinates for ' + town + '. General error')
        lat = np.nan
        lon = np.nan
        gm_id = np.nan

    return lat, lon, gm_id

In [266]:
# Generate coordinates if requested
generate_coordinates = True

if generate_coordinates:
    # Extract coordinates for towns in Cyprus from Geocoding API
    # Takes ~ 1 minute
    # Latitude, longitude, Google Maps ID
    census_df_trnc['lat'], census_df_trnc['lon'], census_df_trnc['gm_id'] = zip(*census_df_trnc.apply(lambda x: extract_coordinates(x['town'], 
                                                        district = 'Kıbrıs',
                                                        boundaries=[34.9, 32.5, 35.8, 34.6]), axis=1))

    # Save coordinates with retrieval date
    census_coordinates = census_df_trnc[['town', 'district', 'lat', 'lon', 'gm_id']]
    census_coordinates.to_csv(f'sources/Geocoding API/geocoding_coordinates_TRNC_{date.today().strftime("%Y-%m-%d")}.csv', index=False)

else:
    # Select the latest coordinates file
    list_of_files = glob.glob('sources/Geocoding API/geocoding_coordinates_TRNC_*.csv')
    latest_file = max(list_of_files, key=os.path.getctime)
    
    # Load coordinates from file
    census_coordinates = pd.read_csv(latest_file)

    # Add coordinates to census dataframe
    census_df_trnc = census_df_trnc.merge(census_coordinates, on=['town', 'district'], how='left')

census_df_trnc.head()

Di̇pkarpaz Kıbrıs
Gemi̇konağı Kıbrıs
Güzelyurt Kıbrıs
Lapta Kıbrıs
Di̇kmen Kıbrıs
Alsancak Kıbrıs
Gi̇rne Kıbrıs
Unable to extract coordinates for Gi̇rne: no locality results found.
Tatlısu Kıbrıs
Gazi̇mağusa Kıbrıs
Deği̇rmenli̇k Kıbrıs
Gönyeli̇ Kıbrıs
Lefkoşa Kıbrıs
İskele Kıbrıs
Hami̇tköy Kıbrıs
Haspolat Kıbrıs
Akıncılar Kıbrıs
Alayköy Kıbrıs
Türkeli̇ Kıbrıs
Yılmazköy Kıbrıs
Kanlıköy Kıbrıs
Balıkesi̇r Kıbrıs
Beyköy Kıbrıs
Ci̇hangi̇r Kıbrıs
Çukurova Kıbrıs
Demi̇rhan Kıbrıs
Di̇lekkaya Kıbrıs
Düzova Kıbrıs
Erdemli̇ Kıbrıs
Gazi̇köy Kıbrıs
Gökhan Kıbrıs
Kalavaç Kıbrıs
Kırıkkale Kıbrıs
Unable to extract coordinates for Kırıkkale: no locality results found.
Kırklar Kıbrıs
Meri̇ç Kıbrıs
Mi̇nareli̇köy Kıbrıs
Taşocakları Kıbrıs
Unable to extract coordinates for Taşocakları: no locality results found.
Yeni̇ceköy Kıbrıs
Yi̇ği̇tler Kıbrıs
Maraş Kıbrıs
Unable to extract coordinates for Maraş: no locality results found.
Mutluyaka Kıbrıs
Tuzla Kıbrıs
Yeni̇ Boğazi̇çi̇ Kıbrıs
Akova Kıbrıs
Alani̇çi̇ Kıb

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id
0,Di̇pkarpaz,2016,1007,1009,1935,968,967,İskele,Yeni,Dipkarpaz,35.617682,34.408731,ChIJFdx3t2tn3xQRInHdNFtCOCg
1,Gemi̇konağı,2075,1318,757,1498,854,644,Güzelyurt,Lefke,Lefke,35.142762,32.81127,ChIJw1gDZOdY5xQRzgnyO-yf0b4
2,Güzelyurt,7251,3619,3632,7627,3885,3742,Güzelyurt,Güzelyurt,Güzelyurt,35.212317,32.977567,ChIJfz7P_1H53RQRmbROZQ0T7bg
3,Lapta,5748,2959,2789,5658,3151,2507,Girne,Girne,Lapta,35.336408,33.163266,ChIJSfTuTocM3hQRNbYv5mkqRUM
4,Di̇kmen,3969,2131,1838,2605,1464,1141,Girne,Girne,Dikmen,35.266918,33.3277,ChIJf0s8-cAT3hQRY6tDs7XxiJg


In [267]:
## Plot and inspect results
# Show the retrieval results to check for errors
# Plot all towns controlled by the TRNC
map = folium.Map(location=[35.1264, 33.4299], zoom_start=9)

for i in range(len(census_df_trnc)):
    try:
        folium.CircleMarker(
            location=[census_df_trnc['lat'][i], census_df_trnc['lon'][i]],
            radius=5,
            color='red',
            fill=True,
            fill_color='red',
            fill_opacity=0.7,
            parse_html=False,
            # Add text when hovering over the marker
            tooltip=f"{census_df_trnc['town'][i]}",
            ).add_to(map)
    except ValueError:
        pass
    
map

## Error correction

After a visual inspection of the results, the following errors were found:
- Several towns were not found in the Geocoding API. Possible reasons include:
    - Several possible results for a unique location were returned.
    - No results were returned. Specifically, the register named `Taşocakları` does not exist and seems to refer to a quarry. Further research is required to determine the nature of this instance.
- The town of Makrásika (`İncirli`), near Famagusta, receives the coordinates for an homonymous quarter in Lefkosia.
- There is missing information for the town of `Aşağı Karaman`. This entity is kept separated from the town of `Karaman` until this case is inspected more closely.

### Correct missing coordinates

In [268]:
# Show towns with missing coordinates
census_df_trnc[census_df_trnc['lat'].isna()]

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id
6,Gi̇rne,21319,11403,9916,18744,10594,8150,Girne,Girne,Girne,,,
31,Kırıkkale,390,184,206,398,189,209,Lefkoşa,Değirmenlik,Değimenlik,,,
35,Taşocakları,0,0,0,128,91,37,Lefkoşa,Değirmenlik,Değimenlik,,,
38,Maraş,226,125,101,503,317,186,Gazimağusa,Gazimağusa,Gazimağusa,,,
55,Çayönü,660,341,319,652,349,303,Gazimağusa,Akdoğan,Beyarmudu,,,
80,Küçükerenköy,339,168,171,257,187,70,Gazimağusa,Geçitkale,Tatlısu,,,
81,Aşağı Karaman,675,346,329,565,312,253,Girne,Girne,Girne,,,
83,Doğanköy,868,429,439,457,245,212,Girne,Girne,Girne,,,
105,Esentepe,1754,920,834,1575,949,626,Girne,Girne,Esentepe,,,
146,Yedi̇dalga,669,329,340,767,397,370,Güzelyurt,Lefke,Lefke,,,


In [269]:
# Prepare manual corrections 
coordinates_manual_mapping = {
    # Town : [lat, lon, gm_id]
    # Unable to disambiguate results
    'Gi̇rne' : [35.299194, 33.2363246, 'ChIJKS2WH74N3hQRjxHlC6st0tM'], 
    
    # Unable to retrieve coordinates automatically
    'Maraş' : [35.105872, 33.9554907, 'ChIJFbxuNW_I3xQRXL-W8cUUcHc'],
    'Küçükerenköy': [35.3630139, 33.6682723, 'ChIJASOqhDZO3hQRLBc-qDsiiEI'],
    'Yedi̇dalga' : [35.144474, 32.8056285, 'ChIJcb1P3tJY5xQRmfZXUDXRy0s'],
    'Boğaztepe' : [35.3169997, 33.9457337, 'ChIJpZeEvrmv3xQRkQn98z8idbo'],
    'Kantara' : [35.3871012, 33.8990248, 'ChIJjZyEuU2p3xQROD6uAAxEHCg'],
    
    # Difficulty accessing Geocoding API instance: unable to find using references to Cyprus
    'Doğanköy' : [35.3241561, 33.3331825, 'ChIJ2_cKfkwT3hQRPMWDanLRPCQ'],

    # There is no entity in Google Maps for Esentepe (Agios Amvrosios) and Çayönü
    # Approximate coordinates from Google Maps are used
    'Esentepe' : [35.3403306, 33.5817749, np.nan],
    'Çayönü' : [35.0963631, 33.7936146, np.nan],

    # Has the same name as a quarter in Nicosia
    'İnci̇rli̇' : [35.0795529, 33.7685332, 'ChIJwaHU1-Iy3hQRbQRoPDGgLv4'],
    
    # Location is set to Boğaziçi due to name similarity
    'Boğaz' : [35.316774, 33.9540969, 'ChIJUasau5ev3xQRzJkaQJUAR-A'],

    # Location is set to Yeni Boğaziçi due to name similarity
    'Boğazi̇çi̇' : [35.272867, 33.8355465, 'ChIJZSUjjWyz3xQRNs6Qsmp39nA'],

    # Has the same Turkish name as a former Turkish-Cypriot village in the ROC-controlled area
    'Yıldırım' : [35.2347575, 33.8020459, 'ChIJVZiJvjtL3hQRyAHzEUei13I'],
    
    # Unknown reason
    'Kırıkkale' : [35.0770318, 33.5895956, 'ChIJQ7t3sDwl3hQRa0Wi8RX92a8'],
    'Taşlıca' : [35.4643867, 34.2171472, 'ChIJKVz5NrZz3xQRC0ySgw6qt50']
    }

# Apply manual corrections
for town in coordinates_manual_mapping.keys():
    census_df_trnc.loc[census_df_trnc['town'] == town, ['lat', 'lon', 'gm_id']] = coordinates_manual_mapping[town]

In [270]:
# Show results
census_df_trnc.head()

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id
0,Di̇pkarpaz,2016,1007,1009,1935,968,967,İskele,Yeni,Dipkarpaz,35.617682,34.408731,ChIJFdx3t2tn3xQRInHdNFtCOCg
1,Gemi̇konağı,2075,1318,757,1498,854,644,Güzelyurt,Lefke,Lefke,35.142762,32.81127,ChIJw1gDZOdY5xQRzgnyO-yf0b4
2,Güzelyurt,7251,3619,3632,7627,3885,3742,Güzelyurt,Güzelyurt,Güzelyurt,35.212317,32.977567,ChIJfz7P_1H53RQRmbROZQ0T7bg
3,Lapta,5748,2959,2789,5658,3151,2507,Girne,Girne,Lapta,35.336408,33.163266,ChIJSfTuTocM3hQRNbYv5mkqRUM
4,Di̇kmen,3969,2131,1838,2605,1464,1141,Girne,Girne,Dikmen,35.266918,33.3277,ChIJf0s8-cAT3hQRY6tDs7XxiJg


The exclave of Kokkina (`Erenköy`) is not included in the census, but falls within the database's scope. [Used as a military camp](https://en.wikipedia.org/wiki/Kokkina), it has a population of 0. Its information will be added manually.

In [271]:
# Create a dictionary for Kokkina
kokkina_info = {
    'town' : 'Erenköy', # Use the Turkish toponym for consistency purposes, it will be ammended later
    'population' : 0,
    'male_population' : 0,
    'female_population' : 0,
    'population_2006' : 0,
    'male_population_2006' : 0,
    'female_population_2006' : 0,
    'district' : 'Lefke',
    'subdistrict' : 'Lefke',
    'municipality' : 'Lefke',
    'lat' : 35.179457,
    'lon' : 32.610868,
    'gm_id' : 'ChIJ_Sn_GWVh5xQRezW4HeaNPhE'
}

# Add Kokkina to the dataframe
if 'Erenköy' not in census_df_trnc['town'].values:
    census_df_trnc = census_df_trnc.append(kokkina_info, ignore_index=True)

## Results inspection after manual corrections

In [272]:
## Plot and inspect results
map = folium.Map(location=[35.1264, 33.4299], zoom_start=9)

for i in range(len(census_df_trnc)):
    try:
        folium.CircleMarker(
            location=[census_df_trnc['lat'][i], census_df_trnc['lon'][i]],
            radius=5,
            color='red',
            fill=True,
            fill_color='red',
            fill_opacity=0.7,
            parse_html=False,
            # Add text when hovering over the marker
            tooltip=f"{census_df_trnc['town'][i]}",
            ).add_to(map)
    except ValueError:
        pass
    
map

## Save checkpoint

In [273]:
# Set version
version = 4

In [274]:
# Save a checkpoint of the dataframe to a csv file
census_df_trnc.to_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv', index=False)

In [275]:
# Load the checkpoint
census_df_trnc = pd.read_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv')

# 7) Add Greek names to settlements not controlled by the ROC

Algonsize with Turkified toponyms, Google Maps also registers the Greek names for settlements in northern Cyprus when searching for them if the language is set to Greek.

## Greek names

In [276]:
# Define a function to extract the Greek name from a Google Maps entity
def extract_greek_name(gm_id):
    """Given a Google Maps ID, extract the Greek name of the entity.
    Used to retrieve the Greek name of a town with a Turkified toponym.
    
    Parameters
    ----------
    gm_id : str
        The Google Maps ID of the entity.
        
    Returns
    -------
    greek_name : str
        The Greek name of the entity."""

    # Set Geocoding API URL
    url = 'https://maps.googleapis.com/maps/api/geocode/json?language=EL&key=' + api_key + '&place_id=' + gm_id

    # Retrieve the entity
    response = requests.get(url)
    entity = json.loads(response.text)

    # Extract the Greek name
    greek_name = entity['results'][0]['address_components'][0]['long_name']
    print(f'Greek name for {gm_id} is {greek_name}')
    
    return greek_name

In [277]:
generate_names = False

if generate_names:
    # Apply the function to the dataframe
    census_df_trnc['greek_name'] = census_df_trnc['gm_id'].apply(lambda x: extract_greek_name(x) if x is not np.nan else np.nan)

    # Extract a table with the town names, the Google IDs and the Greek names
    greek_names_df = census_df_trnc[['town', 'gm_id', 'greek_name']]
    greek_names_df.to_csv(f'sources/Geocoding API/geocoding_greek_toponyms__{date.today().strftime("%Y-%m-%d")}.csv', index=False)

else:
    # Select the latest version of the Greek names table
    list_of_files = glob.glob('sources/Geocoding API/geocoding_greek_toponyms_*.csv')
    latest_file = max(list_of_files, key=os.path.getctime)

    # Load coordinates from file
    greek_names_df = pd.read_csv(latest_file)

    # Add the Greek names table
    census_df_trnc = census_df_trnc.merge(greek_names_df, on=['town', 'gm_id'], how='left')

# Show the dataframe
census_df_trnc.head()

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id,greek_name
0,Di̇pkarpaz,2016,1007,1009,1935,968,967,İskele,Yeni,Dipkarpaz,35.617682,34.408731,ChIJFdx3t2tn3xQRInHdNFtCOCg,Ριζοκάρπασο
1,Gemi̇konağı,2075,1318,757,1498,854,644,Güzelyurt,Lefke,Lefke,35.142762,32.81127,ChIJw1gDZOdY5xQRzgnyO-yf0b4,Καραβοστάσι
2,Güzelyurt,7251,3619,3632,7627,3885,3742,Güzelyurt,Güzelyurt,Güzelyurt,35.212317,32.977567,ChIJfz7P_1H53RQRmbROZQ0T7bg,Μόρφου
3,Lapta,5748,2959,2789,5658,3151,2507,Girne,Girne,Lapta,35.336408,33.163266,ChIJSfTuTocM3hQRNbYv5mkqRUM,Λάπηθος
4,Di̇kmen,3969,2131,1838,2605,1464,1141,Girne,Girne,Dikmen,35.266918,33.3277,ChIJf0s8-cAT3hQRY6tDs7XxiJg,Δίκωμο


In [278]:
census_df_trnc.tail()

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id,greek_name
189,Si̇pahi̇,614,295,319,659,322,337,İskele,Yeni,Yeni,35.533153,34.236602,ChIJRe5YJrdt3xQRNMiDadjI280,Αγία Τριάς
190,Taşlıca,80,40,40,99,53,46,İskele,Yeni,Yeni,35.464387,34.217147,ChIJKVz5NrZz3xQRC0ySgw6qt50,Νέτα
191,Yeşi̇lköy,799,411,388,777,386,391,İskele,Yeni,Yeni,35.503141,34.16432,ChIJ_bg3-89y3xQRWY8WMwdxxjA,Άγιος Ανδρόνικος
192,Zi̇yamet,739,387,352,715,365,350,İskele,Yeni,Yeni,35.46234,34.125375,ChIJv1L4zExz3xQRlKeaoS9biFo,Λεωνάρισο
193,Erenköy,0,0,0,0,0,0,Lefke,Lefke,Lefke,35.179457,32.610868,ChIJ_Sn_GWVh5xQRezW4HeaNPhE,Κόκκινα


## Manual corrections

Certain entities requiere manual corrections due to several reasons.

* `Esentepe` has no entity in Google Maps. Its Greek name is `Άγιος Αμβρόσιος`.
* `Küçukerenköy` in Kyrenia district apparently has no Greek name.

In [279]:
# Define manual corrections for Greek toponyms
greek_name_manual_mapping = {
    # Town : Greek name
    'Esentepe' : 'Άγιος Αμβρόσιος',
    'Küçukerenköy' : np.nan
    }

# Apply manual corrections
for turkish_toponym, greek_toponym in greek_name_manual_mapping.items():
    census_df_trnc.loc[census_df_trnc['town'] == turkish_toponym, 'greek_name'] = greek_toponym

Additionally, the Greek names with accents in the first letter are transcribed incorrectly, since they separate the main vowel and the accent. This requieres a manual correction.

In [280]:
# Replace letters to match the Greek alphabet
vowel_mapping = {
    "\'Α" : "Ά",
    "\'Ε" : "Έ",
    "\'Η" : "Ή",
    "\'Ι" : "Ί",
    "\'Ο" : "Ό",
    "\'Υ" : "Ύ",
    "\'Ω" : "Ώ"}

# Apply the mapping
census_df_trnc['greek_name'] = census_df_trnc['greek_name'].replace(vowel_mapping, regex=True)

In [281]:
census_df_trnc

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id,greek_name
0,Di̇pkarpaz,2016,1007,1009,1935,968,967,İskele,Yeni,Dipkarpaz,35.617682,34.408731,ChIJFdx3t2tn3xQRInHdNFtCOCg,Ριζοκάρπασο
1,Gemi̇konağı,2075,1318,757,1498,854,644,Güzelyurt,Lefke,Lefke,35.142762,32.81127,ChIJw1gDZOdY5xQRzgnyO-yf0b4,Καραβοστάσι
2,Güzelyurt,7251,3619,3632,7627,3885,3742,Güzelyurt,Güzelyurt,Güzelyurt,35.212317,32.977567,ChIJfz7P_1H53RQRmbROZQ0T7bg,Μόρφου
3,Lapta,5748,2959,2789,5658,3151,2507,Girne,Girne,Lapta,35.336408,33.163266,ChIJSfTuTocM3hQRNbYv5mkqRUM,Λάπηθος
4,Di̇kmen,3969,2131,1838,2605,1464,1141,Girne,Girne,Dikmen,35.266918,33.3277,ChIJf0s8-cAT3hQRY6tDs7XxiJg,Δίκωμο
5,Alsancak,5595,2948,2647,4638,2577,2061,Girne,Girne,Alsancak,35.34208,33.208232,ChIJfZOGfvIM3hQRkYt4GUYmTbo,Καραβάς
6,Gi̇rne,21319,11403,9916,18744,10594,8150,Girne,Girne,Girne,35.299194,33.236325,ChIJKS2WH74N3hQRjxHlC6st0tM,Κερύνεια
7,Tatlısu,1120,579,541,903,461,442,Gazimağusa,Geçitkale,Tatlısu,35.378508,33.762946,ChIJR2udT_RS3hQRsTyTpE9AY8M,Ακανθού
8,Gazi̇mağusa,37642,20301,17341,33001,17897,15104,Gazimağusa,Gazimağusa,Gazimağusa,35.114912,33.919245,ChIJPRZM2kLI3xQRh87TpcUD2yk,Αμμόχωστος
9,Deği̇rmenli̇k,3284,1778,1506,3128,1685,1443,Lefkoşa,Değirmenlik,Değimenlik,35.255365,33.471983,ChIJlaSb_gc_3hQR-DikEPtnxRA,Κυθρέα


## Perform sanity checks

* Check that there are no duplicates in the `gm_id` columns aside from NaN values.

In [282]:
# Check for duplicates in gm_id
display(census_df_trnc[census_df_trnc['gm_id'].duplicated(keep='first')])
display(census_df_trnc[census_df_trnc['gm_id'].duplicated(keep='last')])

Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id,greek_name
55,Çayönü,660,341,319,652,349,303,Gazimağusa,Akdoğan,Beyarmudu,35.096363,33.793615,,
81,Aşağı Karaman,675,346,329,565,312,253,Girne,Girne,Girne,,,,
105,Esentepe,1754,920,834,1575,949,626,Girne,Girne,Esentepe,35.340331,33.581775,,Άγιος Αμβρόσιος


Unnamed: 0,town,population,male_population,female_population,population_2006,male_population_2006,female_population_2006,district,subdistrict,municipality,lat,lon,gm_id,greek_name
35,Taşocakları,0,0,0,128,91,37,Lefkoşa,Değirmenlik,Değimenlik,,,,
55,Çayönü,660,341,319,652,349,303,Gazimağusa,Akdoğan,Beyarmudu,35.096363,33.793615,,
81,Aşağı Karaman,675,346,329,565,312,253,Girne,Girne,Girne,,,,


## Save checkpoint

In [295]:
# Set version
version = 5

In [284]:
# Save a checkpoint of the dataframe to a csv file
census_df_trnc.to_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv', index=False)

In [307]:
# Load the checkpoint
census_df_trnc = pd.read_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv')

# 8) Recover the correct coordinates for non-ROC-controlled villages from OpenStreetMap

The "i"s contained in the TRNC census do not share the same ASCII code as the "i": they are formed by a regular "i" plus a "·" symbol over it. The difference in ASCII codes despite their visual similarity prevents the data retrieval from the OpenStreetMap API. To solve this issue, the "i" is replaced by a regular "i" and the data is retrieved.

In [308]:
def normalize_turkish_toponyms(toponym):
    """Replaces specific Turkish caracters to match with
    their Latin counterparts. This ensures that the character 
    codification is consistent with non-Turkish sources."""

    # Replace "i̇" (i + ·) with "i"
    toponym = toponym.replace('i̇', 'i')

    return toponym

In [309]:
def extract_exact_coordinates_osm(
        town_name: str,
        bbox: str = '(34.9,32.5,35.75,34.7)') -> tuple[float, float]:
    """Given a town name, extract the exact coordinates of the town from OpenStreetMap.

    Parameters
    ----------
    town_name : str
        The name of the town as it appears in the census.
    bbox : str, optional
        The bounding box to search in. Defaults to '(34.9,32.5,35.75,34.7)', 
        which corresponds to the coordinates of the occupied part of Cyprus.
        The format is (minlat, minlon, maxlat, maxlon).
    
    Returns
    -------
    lat, lon : tuple[float, float]
        The latitude and longitude of the town.
    """

    # Normalize the town name
    town_name = normalize_turkish_toponyms(town_name)

    # Set URL and query
    url = "https://overpass-api.de/api/interpreter"
    query = f"""
    [out:json];
    (
    node["name"="{town_name}"]["place"]{bbox};
    );
    out;
    """

    # way["name"="{town_name}"]["place"]{bbox};
    # relation["name"="{town_name}"]["place"]{bbox};
    
    # Send the query to the Overpass API
    response = requests.get(url, params={"data": query})

    # Parse the JSON response
    data = json.loads(response.text)

    # Check that there is only one result
    if len(data['elements']) != 1:
        print(f'Error: {town_name} has {len(data["elements"])} results.')
        return np.nan, np.nan
    else:
        print(f'Results for {town_name} saved successfully.')
    
    # Extract the coordinates
    lat = data['elements'][0]['lat']
    lon = data['elements'][0]['lon']

    return lat, lon    

In [310]:
# Create an empty column for the coordinates to fill
census_df_trnc['osm_lat'], census_df_trnc['osm_lon'] = np.nan, np.nan

# Generate coordinates if requested
generate_coordinates = False

if generate_coordinates:
    # Apply the function to the dataframe
    census_df_trnc['osm_lat'], census_df_trnc['osm_lon'] = zip(*census_df_trnc.apply(lambda x: extract_exact_coordinates_osm(x['town']), axis=1))

    # Extract a table with the town names, the Google IDs and the Greek names
    osm_coordinates_df = census_df_trnc[['town', 'osm_lat', 'osm_lon']]
    osm_coordinates_df.to_csv(f'sources/OpenStreetMap/osm_coordinates__{date.today().strftime("%Y-%m-%d")}.csv', index=False)

else:
    # Select the latest version of the Greek names table
    list_of_files = glob.glob('sources/OpenStreetMap/osm_coordinates_*.csv')
    latest_file = max(list_of_files, key=os.path.getctime)

    # Load coordinates from file
    osm_coordinates_df = pd.read_csv(latest_file)

    # Add the OSM coordinates
    census_df_trnc = census_df_trnc.merge(osm_coordinates_df, on=['town'], how='left')

    # Remove 'osm_lat_x' and 'osm_lon_x' columns generated by the merge
    census_df_trnc = census_df_trnc.drop(columns=['osm_lat_x', 'osm_lon_x'])

    # Rename 'osm_lat_y' and 'osm_lon_y' columns
    census_df_trnc = census_df_trnc.rename(columns={'osm_lat_y': 'osm_lat', 'osm_lon_y': 'osm_lon'})

In [313]:
# Inspect results
census_df_trnc[['town', 'osm_lat', 'osm_lon']]

Unnamed: 0,town,osm_lat,osm_lon
0,Di̇pkarpaz,35.59774,34.380671
1,Gemi̇konağı,35.138837,32.834612
2,Güzelyurt,35.198552,32.993499
3,Lapta,35.340912,33.175342
4,Di̇kmen,,
5,Alsancak,35.343171,33.195612
6,Gi̇rne,,
7,Tatlısu,35.373241,33.753563
8,Gazi̇mağusa,35.124544,33.932542
9,Deği̇rmenli̇k,35.247535,33.481284


In [314]:
# Show NaNs
census_df_trnc[census_df_trnc['osm_lat'].isna()][['town', 'osm_lat']]

Unnamed: 0,town,osm_lat
4,Di̇kmen,
6,Gi̇rne,
11,Lefkoşa,
35,Taşocakları,
41,Yeni̇ Boğazi̇çi̇,
64,Pi̇le,
71,Nergi̇sli̇,
80,Küçükerenköy,
81,Aşağı Karaman,
88,Zeyti̇nli̇k Köy,


## Add missing coordinates

Certain entities do not have coordinates from OpenStreetMap due to different reasons:
- Manual corrections of the town names in previous steps.
- Merged towns (`Dikmen`).
- Other reasons.

Coordinates will be added manually to ensure the integrity of the dataset.

In [315]:
manual_osm_coordinates = {
    # Town : (lat, lon)
    'Di̇kmen' : (35.2677, 33.3252), # Using coordinates for Aşagi Dikmen (Kato Díkomo)
    'Gi̇rne' : (35.3351,33.3193),
    'Lefkoşa' : (35.1783, 33.3628),
    'Yeni̇ Boğazi̇çi̇' : (35.1858,33.8926), # Using an approximate location; there is no entity for this settlement
    'Pi̇le' : (35.0136, 33.6919), # Settlement included in the ROC census
    'Nergi̇sli̇' : (35.2179,33.6982), # Appears as "Nergizli"
    'Zeyti̇nli̇k Köy': (35.3330, 33.2915),
    'Yukarı Taşkent': (35.2785, 33.3770),
    'Kılıçarslan' : (35.2687,33.1259), # Listed as 'Kıliçarslan" in OSM
    'Tepebaşı' : (35.3060, 33.0540),
    'Kalkanlı' : (35.2465, 33.0379),
    'Deni̇zli̇' : (35.1474, 32.8499), # Adding coordinates from Google Maps; there is no entity for this settlement in OSM.
    'Sazlıköy' : (35.3991, 34.0257)
    }

# Apply manual corrections
for turkish_toponym, (lat, lon) in manual_osm_coordinates.items():
    census_df_trnc.loc[census_df_trnc['town'] == turkish_toponym, 'osm_lat'] = lat
    census_df_trnc.loc[census_df_trnc['town'] == turkish_toponym, 'osm_lon'] = lon

It was not possible to retrieve the coordinates for the following settlements, probably due to them being special cases:
* `Küçükerenköy`
* `Taşocakları`
* `Aşağı Karaman`

In [316]:
# Inspect results
census_df_trnc[['town', 'osm_lat', 'osm_lon']]

Unnamed: 0,town,osm_lat,osm_lon
0,Di̇pkarpaz,35.59774,34.380671
1,Gemi̇konağı,35.138837,32.834612
2,Güzelyurt,35.198552,32.993499
3,Lapta,35.340912,33.175342
4,Di̇kmen,35.2677,33.3252
5,Alsancak,35.343171,33.195612
6,Gi̇rne,35.3351,33.3193
7,Tatlısu,35.373241,33.753563
8,Gazi̇mağusa,35.124544,33.932542
9,Deği̇rmenli̇k,35.247535,33.481284


## Set OSM coordinates as the main coordinates

In [317]:
# Keep the old coordinates in a separate column
census_df_trnc['lat_g'], census_df_trnc['lon_g'] = census_df_trnc['lat'], census_df_trnc['lon']

# Replace the old coordinates with the OSM ones
census_df_trnc['lat'], census_df_trnc['lon'] = census_df_trnc['osm_lat'], census_df_trnc['osm_lon']

# Drop OSM coordinates
census_df_trnc.drop(columns=['osm_lat', 'osm_lon'], inplace=True)

## Save checkpoint

In [318]:
# Set version
version = 6

In [319]:
# Save a checkpoint of the dataframe to a csv file
census_df_trnc.to_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv', index=False)

In [320]:
# Load the checkpoint
census_df_trnc = pd.read_csv(f'checkpoints/secondary_checkpoints/TRNC_census_cp_v{str(version)}.csv')

# 9) Merge ROC and TRNC census dataframes

TODO:
* Add a column for control by the ROC
* Remove repeated gm_ids

## Concatenate the two dataframes

In [379]:
# Concatenate both dataframes
cyprus_df = pd.concat([census_df, census_df_trnc], ignore_index=True)

cyprus_df

Unnamed: 0,geo_code,town,population,male_population,female_population,district,is_dimos,lat,lon,gm_id,english_name,turkish_name,population_2006,male_population_2006,female_population_2006,subdistrict,municipality,greek_name,lat_g,lon_g
0,1000.0,Λευκωσία,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,Nicosia,Lefkoşa,,,,,,,,
1,1010.0,Άγιος Δομέτιος,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,Agios Dometios,Aydemet,,,,,,,,
2,1011.0,Έγκωμη,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,Egkomi,İncirli,,,,,,,,
3,1012.0,Στροβόλος,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,Strovolos,Strovolos,,,,,,,,
4,1013.0,Αγλαντζιά,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,Aglantzia,Atalassa,,,,,,,,
5,1021.0,Λακατάμεια,38345,18674,19671,Λευκωσίας,True,35.118566,33.314538,ChIJvfpe0pgb3hQRzVO7VGoRPls,Lakatamia,Lakadamya,,,,,,,,
6,1022.0,Συνοικισμός Ανθούπολης,1756,767,989,Λευκωσίας,False,35.113951,33.289679,ChIJB3reYWMb3hQRdCFNyi8yBbM,Anthoupolis,Anthoupolis,,,,,,,,
7,1023.0,Λατσιά,16774,8132,8642,Λευκωσίας,True,35.106364,33.378267,ChIJbbBNlTcZ3hQRO3tmqqvDS3E,Latsia,Uluyurt,,,,,,,,
8,1024.0,Γέρι,8235,4065,4170,Λευκωσίας,False,35.10757,33.422429,ChIJea8LrGAY3hQRBtiaqfqbnPM,Geri,Yeri,,,,,,,,
9,1100.0,Σιά,754,368,386,Λευκωσίας,False,34.955289,33.389474,ChIJyaqxJB2h4BQR4xR_8HkompQ,Sia,Sha,,,,,,,,


## Modify the concatenated dataframe

In [380]:
# Add a column indicating whether the settlement is under ROC control
# If there is no data for the `population_2006` column, it is assumed that the settlement is not under ROC control
cyprus_df['roc_control'] = [True if np.isnan(x) else False for x in cyprus_df['population_2006']]

# Inspect results
display(cyprus_df[['town', 'roc_control']].head())
display(cyprus_df[['town', 'roc_control']].tail())

Unnamed: 0,town,roc_control
0,Λευκωσία,True
1,Άγιος Δομέτιος,True
2,Έγκωμη,True
3,Στροβόλος,True
4,Αγλαντζιά,True


Unnamed: 0,town,roc_control
589,Si̇pahi̇,False
590,Taşlıca,False
591,Yeşi̇lköy,False
592,Zi̇yamet,False
593,Erenköy,False


In [381]:
# Drop the 'geo_code' column
cyprus_df.drop(columns=['geo_code'], inplace=True)

In [382]:
# Fill NaNs in is_dimos with False
cyprus_df['is_dimos'].fillna(False, inplace=True)

In [383]:
# Create a column with the raw information about districts. The `district` column will be modified later
cyprus_df['district_raw'] = cyprus_df['district']

## Reorder columns

In [384]:
# Distinguish common and non-common columns
common_columns = [
    'town', 'greek_name', 'turkish_name', 
    'population', 'male_population', 'female_population', 
    'district', 'roc_control',
    'lat', 'lon', 'gm_id']

non_common_columns = [
    'is_dimos',
    'population_2006', 'male_population_2006', 'female_population_2006',
    'district_raw', 'subdistrict', 'municipality', 
    'lat_g', 'lon_g']

# Set order
order = common_columns + non_common_columns

# Reorder columns
cyprus_df = cyprus_df[order]
cyprus_df.head()

Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Λευκωσία,,Lefkoşa,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Άγιος Δομέτιος,,Aydemet,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Έγκωμη,,İncirli,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Στροβόλος,,Strovolos,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Αγλαντζιά,,Atalassa,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,


## Map merged results

In [385]:
map = folium.Map(location=[35.1264, 33.4299], zoom_start=9)

for i in range(len(cyprus_df)):
    if not np.isnan(cyprus_df['lat'][i]):
        folium.CircleMarker(
            location=[cyprus_df['lat'][i], cyprus_df['lon'][i]],
            radius=5,
            color= 'blue' if cyprus_df['roc_control'][i] else 'red',
            fill=True,
            fill_color= 'blue' if cyprus_df['roc_control'][i] else 'red',
            fill_opacity=0.7,
            # Add text with the town name when hovering over the marker
            tooltip=f"{cyprus_df['town'][i]}",
            parse_html=False).add_to(map)
    
map

In [386]:
# Show all rows
cyprus_df

Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Λευκωσία,,Lefkoşa,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Άγιος Δομέτιος,,Aydemet,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Έγκωμη,,İncirli,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Στροβόλος,,Strovolos,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Αγλαντζιά,,Atalassa,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,
5,Λακατάμεια,,Lakadamya,38345,18674,19671,Λευκωσίας,True,35.118566,33.314538,ChIJvfpe0pgb3hQRzVO7VGoRPls,True,,,,Λευκωσίας,,,,
6,Συνοικισμός Ανθούπολης,,Anthoupolis,1756,767,989,Λευκωσίας,True,35.113951,33.289679,ChIJB3reYWMb3hQRdCFNyi8yBbM,False,,,,Λευκωσίας,,,,
7,Λατσιά,,Uluyurt,16774,8132,8642,Λευκωσίας,True,35.106364,33.378267,ChIJbbBNlTcZ3hQRO3tmqqvDS3E,True,,,,Λευκωσίας,,,,
8,Γέρι,,Yeri,8235,4065,4170,Λευκωσίας,True,35.10757,33.422429,ChIJea8LrGAY3hQRBtiaqfqbnPM,False,,,,Λευκωσίας,,,,
9,Σιά,,Sha,754,368,386,Λευκωσίας,True,34.955289,33.389474,ChIJyaqxJB2h4BQR4xR_8HkompQ,False,,,,Λευκωσίας,,,,


## Set standard toponyms

The Latinized version of the Greek toponyms will be used as the standard name for each settlement. The Greek toponyms will be used as the alternative name.

In [387]:
# Fill the NaNs in the `greek_name` column with the `town` column for the ROC settlements
roc_towns = cyprus_df.loc[cyprus_df['roc_control'] == True].copy()
roc_towns.loc[roc_towns['greek_name'].isna(), 'greek_name'] = roc_towns.loc[roc_towns['greek_name'].isna(), 'town']

# Fill the NaNs in the `turkish_name` column with the `town` column for the TRNC settlements
trnc_towns = cyprus_df.loc[cyprus_df['roc_control'] == False].copy()
trnc_towns.loc[trnc_towns['turkish_name'].isna(), 'turkish_name'] = trnc_towns.loc[trnc_towns['turkish_name'].isna(), 'town']

# Combine the ROC and TRNC DataFrames back into `cyprus_df`
cyprus_df = pd.concat([roc_towns, trnc_towns])

# Show results
display(cyprus_df.head())
display(cyprus_df.tail())

Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Λευκωσία,Λευκωσία,Lefkoşa,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Άγιος Δομέτιος,Άγιος Δομέτιος,Aydemet,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Έγκωμη,Έγκωμη,İncirli,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Στροβόλος,Στροβόλος,Strovolos,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Αγλαντζιά,Αγλαντζιά,Atalassa,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,


Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
589,Si̇pahi̇,Αγία Τριάς,Si̇pahi̇,614,295,319,İskele,False,35.542606,34.224571,ChIJRe5YJrdt3xQRNMiDadjI280,False,659.0,322.0,337.0,İskele,Yeni,Yeni,35.533153,34.236602
590,Taşlıca,Νέτα,Taşlıca,80,40,40,İskele,False,35.469783,34.214029,ChIJKVz5NrZz3xQRC0ySgw6qt50,False,99.0,53.0,46.0,İskele,Yeni,Yeni,35.464387,34.217147
591,Yeşi̇lköy,Άγιος Ανδρόνικος,Yeşi̇lköy,799,411,388,İskele,False,35.505035,34.162471,ChIJ_bg3-89y3xQRWY8WMwdxxjA,False,777.0,386.0,391.0,İskele,Yeni,Yeni,35.503141,34.16432
592,Zi̇yamet,Λεωνάρισο,Zi̇yamet,739,387,352,İskele,False,35.468513,34.138289,ChIJv1L4zExz3xQRlKeaoS9biFo,False,715.0,365.0,350.0,İskele,Yeni,Yeni,35.46234,34.125375
593,Erenköy,Κόκκινα,Erenköy,0,0,0,Lefke,False,35.179002,32.609712,ChIJ_Sn_GWVh5xQRezW4HeaNPhE,False,0.0,0.0,0.0,Lefke,Lefke,Lefke,35.179457,32.610868


In [388]:
# Function to transcribe Greek characters to Latin characters
def greek_to_latin(text):
    """
    Transcribes Greek characters to Latin characters.

    Parameters
    ----------
    text : str
        The text to be transcribed.

    Returns
    -------
    str
        The transcribed text.
    """

    # If is a NaN, return it
    if pd.isna(text):
        return text
    
    # Otherwise, transcribe
    else:
        greek_to_latin_dict = {
            'Ά': 'A',
            'Έ': 'E',
            'Ή': 'I',
            'Ί': 'I',
            'Ό': 'O',
            'Ύ': 'Y',
            'Ώ': 'O',
            'Ϊ': 'I',
            'Ϋ': 'Y',
            'ά': 'a',
            'έ': 'e',
            'ή': 'i',
            'ί': 'i',
            'ό': 'o',
            'ύ': 'y',
            'ώ': 'o',
            'ϊ': 'i',
            'ϋ': 'y',
            'ΐ': 'i',
            'ΰ': 'y',
            'Α': 'A',
            'Β': 'V',
            'Γ': 'G',
            'Δ': 'D',
            'Ε': 'E',
            'Ζ': 'Z',
            'Η': 'I',
            'Θ': 'Th',
            'Ι': 'I',
            'Κ': 'K',
            'Λ': 'L',
            'Μ': 'M',
            'Ν': 'N',
            'Ξ': 'X',
            'Ο': 'O',
            'Π': 'P',
            'Ρ': 'R',
            'Σ': 'S',
            'Τ': 'T',
            'Υ': 'Y',
            'Φ': 'F',
            'Χ': 'Ch',
            'Ψ': 'Ps',
            'Ω': 'O',
            'α': 'a',
            'β': 'v',
            'γ': 'g',
            'δ': 'd',
            'ε': 'e',
            'ζ': 'z',
            'η': 'i',
            'θ': 'th',
            'ι': 'i',
            'κ': 'k',
            'λ': 'l',
            'μ': 'm',
            'ν': 'n',
            'ξ': 'x',
            'ο': 'o',
            'π': 'p',
            'ρ': 'r',
            'σ': 's',
            'τ': 't',
            'υ': 'y',
            'φ': 'f',
            'χ': 'ch',
            'ψ': 'ps',
            'ω': 'o',
            'ς': 's'}
        
        # Replace first 'ου' by 'ou' and then 'ού' by 'ou'
        text = text.replace('ου', 'ou').replace('ού', 'ou')

        # Replace all other Greek characters by their Latin equivalent
        for letter in text:
            if letter in greek_to_latin_dict:
                text = text.replace(letter, greek_to_latin_dict[letter])

        # Change 'y' to 'v' if it is after an 'a' or an 'e'
        text = re.sub(r'([ae])y', r'\1v', text)

        # Change 'v' to 'f' if it is before a 't'
        text = re.sub(r'v([t])', r'f\1', text)
        
        return text


In [389]:
# Apply the function to the `greek_name` and convert it to the town name
cyprus_df['town'] = cyprus_df['greek_name'].apply(greek_to_latin)

# As a special case, change the town name of 'Λευκωσία' to 'Lefkosia'
cyprus_df.loc[cyprus_df['greek_name'] == 'Λευκωσία', 'town'] = 'Lefkosia'

# Show results
display(cyprus_df.head())
display(cyprus_df.tail())

Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Lefkosia,Λευκωσία,Lefkoşa,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Agios Dometios,Άγιος Δομέτιος,Aydemet,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Egkomi,Έγκωμη,İncirli,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Strovolos,Στροβόλος,Strovolos,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Aglantzia,Αγλαντζιά,Atalassa,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,


Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
589,Agia Trias,Αγία Τριάς,Si̇pahi̇,614,295,319,İskele,False,35.542606,34.224571,ChIJRe5YJrdt3xQRNMiDadjI280,False,659.0,322.0,337.0,İskele,Yeni,Yeni,35.533153,34.236602
590,Neta,Νέτα,Taşlıca,80,40,40,İskele,False,35.469783,34.214029,ChIJKVz5NrZz3xQRC0ySgw6qt50,False,99.0,53.0,46.0,İskele,Yeni,Yeni,35.464387,34.217147
591,Agios Andronikos,Άγιος Ανδρόνικος,Yeşi̇lköy,799,411,388,İskele,False,35.505035,34.162471,ChIJ_bg3-89y3xQRWY8WMwdxxjA,False,777.0,386.0,391.0,İskele,Yeni,Yeni,35.503141,34.16432
592,Leonariso,Λεωνάρισο,Zi̇yamet,739,387,352,İskele,False,35.468513,34.138289,ChIJv1L4zExz3xQRlKeaoS9biFo,False,715.0,365.0,350.0,İskele,Yeni,Yeni,35.46234,34.125375
593,Kokkina,Κόκκινα,Erenköy,0,0,0,Lefke,False,35.179002,32.609712,ChIJ_Sn_GWVh5xQRezW4HeaNPhE,False,0.0,0.0,0.0,Lefke,Lefke,Lefke,35.179457,32.610868


For the special cases where there is no Greek toponym, use the Turkish toponym. In order to be easily accesible using standard Latin characters, the Turkish toponyms will be altered, replacing the following characters:
- `ı` with `i`
- `i` (`ı` plus dot) with `i`
- `ğ` with `g`
- `ş` with `sh`
- `İ` with `I`
- `I` with `I`

In [398]:
def latinize_turkish_toponyms(text):
    """
    Latinizes Turkish toponyms. This makes it easier to search for them in the
    database with a standard Latin alphabet keyboard.

    Parameters
    ----------
    text : str
        The text to be latinized.

    Returns
    -------
    str
        The latinized text.
    """
    # Otherwise, latinize
    replacements = {
        'ı': 'i', 
        'i': 'i', 
        'ğ': 'g', 
        'ş': 'sh', 
        'İ': 'I',
        'I' : 'I'}
    
    for k, v in replacements.items():
        text = text.replace(k, v)
    return text

In [391]:
# Fill the NaNs in the `town` column with the `turkish_name` column for the TRNC settlements
cyprus_df.loc[cyprus_df['town'].isna(), 'town'] = cyprus_df.loc[cyprus_df['town'].isna(), 'turkish_name'].apply(latinize_turkish_toponyms)

# Show results
display(cyprus_df.head())
display(cyprus_df.tail())

Taşocakları
Çayönü
Aşağı Karaman


Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Lefkosia,Λευκωσία,Lefkoşa,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Agios Dometios,Άγιος Δομέτιος,Aydemet,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Egkomi,Έγκωμη,İncirli,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Strovolos,Στροβόλος,Strovolos,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Aglantzia,Αγλαντζιά,Atalassa,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,


Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
589,Agia Trias,Αγία Τριάς,Si̇pahi̇,614,295,319,İskele,False,35.542606,34.224571,ChIJRe5YJrdt3xQRNMiDadjI280,False,659.0,322.0,337.0,İskele,Yeni,Yeni,35.533153,34.236602
590,Neta,Νέτα,Taşlıca,80,40,40,İskele,False,35.469783,34.214029,ChIJKVz5NrZz3xQRC0ySgw6qt50,False,99.0,53.0,46.0,İskele,Yeni,Yeni,35.464387,34.217147
591,Agios Andronikos,Άγιος Ανδρόνικος,Yeşi̇lköy,799,411,388,İskele,False,35.505035,34.162471,ChIJ_bg3-89y3xQRWY8WMwdxxjA,False,777.0,386.0,391.0,İskele,Yeni,Yeni,35.503141,34.16432
592,Leonariso,Λεωνάρισο,Zi̇yamet,739,387,352,İskele,False,35.468513,34.138289,ChIJv1L4zExz3xQRlKeaoS9biFo,False,715.0,365.0,350.0,İskele,Yeni,Yeni,35.46234,34.125375
593,Kokkina,Κόκκινα,Erenköy,0,0,0,Lefke,False,35.179002,32.609712,ChIJ_Sn_GWVh5xQRezW4HeaNPhE,False,0.0,0.0,0.0,Lefke,Lefke,Lefke,35.179457,32.610868


In [392]:
cyprus_df

Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Lefkosia,Λευκωσία,Lefkoşa,55014,26520,28494,Λευκωσίας,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Agios Dometios,Άγιος Δομέτιος,Aydemet,12456,5861,6595,Λευκωσίας,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Egkomi,Έγκωμη,İncirli,18010,8547,9463,Λευκωσίας,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Strovolos,Στροβόλος,Strovolos,67904,32248,35656,Λευκωσίας,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Aglantzia,Αγλαντζιά,Atalassa,20783,9803,10980,Λευκωσίας,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,
5,Lakatameia,Λακατάμεια,Lakadamya,38345,18674,19671,Λευκωσίας,True,35.118566,33.314538,ChIJvfpe0pgb3hQRzVO7VGoRPls,True,,,,Λευκωσίας,,,,
6,Synoikismos Anthoupolis,Συνοικισμός Ανθούπολης,Anthoupolis,1756,767,989,Λευκωσίας,True,35.113951,33.289679,ChIJB3reYWMb3hQRdCFNyi8yBbM,False,,,,Λευκωσίας,,,,
7,Latsia,Λατσιά,Uluyurt,16774,8132,8642,Λευκωσίας,True,35.106364,33.378267,ChIJbbBNlTcZ3hQRO3tmqqvDS3E,True,,,,Λευκωσίας,,,,
8,Geri,Γέρι,Yeri,8235,4065,4170,Λευκωσίας,True,35.10757,33.422429,ChIJea8LrGAY3hQRBtiaqfqbnPM,False,,,,Λευκωσίας,,,,
9,Sia,Σιά,Sha,754,368,386,Λευκωσίας,True,34.955289,33.389474,ChIJyaqxJB2h4BQR4xR_8HkompQ,False,,,,Λευκωσίας,,,,


## Standardize districts

Same rules as for town names, but applied to the `district` column. The district names in genitive case are changed to nominative case.

In [395]:
# Set genitive - nominative pairs for ROC districts
genitive_nominative_pairs = {
    # Genitive : Nominative
    'Λευκωσίας' : 'Λεφκωσία', # Made on purpose to make conversion to Lefkosia easier later
    'Αμμοχώστου' : 'Αμμοχώστος',
    'Λάρνακας' : 'Λάρνακα',
    'Λεμεσού' : 'Λεμεσός',
    'Πάφου' : 'Πάφος'}

# Apply mapping to the `district` column if the settlement is under ROC control
cyprus_df['district'] = cyprus_df.apply(lambda x: genitive_nominative_pairs[x['district']] if x['roc_control'] == True else x['district'], axis=1)

In [399]:
# Latinize the `district` column for ROC and non-ROC districts
# Greek districts
cyprus_df['district'] = cyprus_df['district'].apply(greek_to_latin)

# Turkish districts
cyprus_df['district'] = cyprus_df['district'].apply(latinize_turkish_toponyms)

# Show results
display(cyprus_df.head())
display(cyprus_df.tail())

Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
0,Lefkosia,Λευκωσία,Lefkoşa,55014,26520,28494,Lefkosia,True,35.185566,33.382276,ChIJVU1JymcX3hQRpcARA5ykXls,True,,,,Λευκωσίας,,,,
1,Agios Dometios,Άγιος Δομέτιος,Aydemet,12456,5861,6595,Lefkosia,True,35.172787,33.329092,ChIJqzUpSaEQ3hQRgmUX_emhREA,True,,,,Λευκωσίας,,,,
2,Egkomi,Έγκωμη,İncirli,18010,8547,9463,Lefkosia,True,35.153823,33.316954,ChIJtcdnJb8Q3hQR80ccwpjkIPk,True,,,,Λευκωσίας,,,,
3,Strovolos,Στροβόλος,Strovolos,67904,32248,35656,Lefkosia,True,35.143663,33.343791,ChIJ90x2dika3hQRq7-H2HRHAJo,True,,,,Λευκωσίας,,,,
4,Aglantzia,Αγλαντζιά,Atalassa,20783,9803,10980,Lefkosia,True,35.149803,33.394086,ChIJb0vkuNMZ3hQR_4oSWBdFjX0,True,,,,Λευκωσίας,,,,


Unnamed: 0,town,greek_name,turkish_name,population,male_population,female_population,district,roc_control,lat,lon,gm_id,is_dimos,population_2006,male_population_2006,female_population_2006,district_raw,subdistrict,municipality,lat_g,lon_g
589,Agia Trias,Αγία Τριάς,Si̇pahi̇,614,295,319,Iskele,False,35.542606,34.224571,ChIJRe5YJrdt3xQRNMiDadjI280,False,659.0,322.0,337.0,İskele,Yeni,Yeni,35.533153,34.236602
590,Neta,Νέτα,Taşlıca,80,40,40,Iskele,False,35.469783,34.214029,ChIJKVz5NrZz3xQRC0ySgw6qt50,False,99.0,53.0,46.0,İskele,Yeni,Yeni,35.464387,34.217147
591,Agios Andronikos,Άγιος Ανδρόνικος,Yeşi̇lköy,799,411,388,Iskele,False,35.505035,34.162471,ChIJ_bg3-89y3xQRWY8WMwdxxjA,False,777.0,386.0,391.0,İskele,Yeni,Yeni,35.503141,34.16432
592,Leonariso,Λεωνάρισο,Zi̇yamet,739,387,352,Iskele,False,35.468513,34.138289,ChIJv1L4zExz3xQRlKeaoS9biFo,False,715.0,365.0,350.0,İskele,Yeni,Yeni,35.46234,34.125375
593,Kokkina,Κόκκινα,Erenköy,0,0,0,Lefke,False,35.179002,32.609712,ChIJ_Sn_GWVh5xQRezW4HeaNPhE,False,0.0,0.0,0.0,Lefke,Lefke,Lefke,35.179457,32.610868


## Save checkpoint

In [400]:
# Set version
version = 5

In [401]:
# Save a checkpoint of the dataframe to a csv file
cyprus_df.to_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv', index=False)

In [402]:
# Load the checkpoint
cyprus_df = pd.read_csv(f'checkpoints/CyprusDB_cp_v{str(version)}.csv')

# Save final dataset

In [403]:
# Set file name
filename = 'CyprusDB'

# Save results to a CSV and Excel file
cyprus_df.to_csv(f'{filename}.csv', index=False)
cyprus_df.to_excel(f'{filename}.xlsx', index=False)