# Neighbouring Countries Nodes

The treatment in this notebook is analogous to the one for node generation, except for the node selection part.

We develop the following pipeline using France data from Geonames. Treatment for Italy, Germany and Austria is added in the actual project code. 

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Geonames Data

We perform analogous treatment to extract and clean the data.

In [9]:
# Data delimited by tabs, utf-8 encoding
df = pd.read_csv('../data/FR/FR.txt', header=None, encoding='utf8', delimiter='\t', dtype={10: str})

# Build the index
index = ['geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude', 'feature class',
         'feature code', 'country code', 'cc2', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code',
         'population', 'elevation', 'dem', 'timezone', 'modification date']

df.columns = index

df.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature class,feature code,country code,cc2,admin1 code,admin2 code,admin3 code,admin4 code,population,elevation,dem,timezone,modification date
0,2659086,"Recon, Col de","Recon, Col de","Rapenaz Col de,Recon Col de",46.30352,6.82838,T,PASS,FR,CH,84,74.0,744.0,74058.0,0,,1733,Europe/Zurich,2016-12-10
1,2659815,Lucelle,Lucelle,"La Lucelle Riviere,La Lucelle Rivière,Lucelle,...",47.41667,7.5,H,STM,FR,,0,,,,0,,353,Europe/Zurich,2014-08-05
2,2659933,Les Cornettes de Bise,Les Cornettes de Bise,"Cornettes de Bise,Les Cornettes de Bise",46.33333,6.78333,T,PK,FR,CH,84,74.0,744.0,74058.0,0,2432.0,2322,Europe/Zurich,2016-02-18
3,2659943,Ruisseau le Lertzbach,Ruisseau le Lertzbach,"Le Lertzbach Ruisseau,Lertzbach,Ruisseau le Le...",47.58333,7.58333,H,STM,FR,,0,,,,0,,245,Europe/Zurich,2012-06-05
4,2659973,Le Cheval Blanc,Le Cheval Blanc,Le Cheval Blanc,46.05132,6.87178,T,MT,FR,CH,84,74.0,742.0,74290.0,0,2831.0,2807,Europe/Zurich,2016-12-10


In [10]:
# Drop null population
df = df[df['population'] != 0]

# Keep only cities : PPL or PPLA or PPLC feature code
df = df[df['feature code'].str.contains(r'PPL(A\d?|C)?$')]

# Keep the columns we need
df = df[['asciiname', 'latitude', 'longitude', 'admin1 code', 'feature code','population']]
print('Shape : {}'.format(df.shape))
df.head()

Shape : (33965, 6)




Unnamed: 0,asciiname,latitude,longitude,admin1 code,feature code,population
19,Peyrat-le-Chateau,45.81578,1.77233,75,PPL,1140
23,Domecy-sur-le-Vault,47.49084,3.80953,27,PPL,107
24,Blaye,45.13333,-0.66667,75,PPLA3,5277
25,Zuytpeene,50.79473,2.43027,32,PPL,483
26,Zuydcoote,51.06096,2.49338,32,PPL,1660


In [22]:
df[df['asciiname'] == 'Pontarlier']

Unnamed: 0,asciiname,latitude,longitude,admin1 code,feature code,population
19177,Pontarlier,46.90347,6.35542,27,PPLA3,20313


## Node Selection

### Main idea
The key point here is to define criteria to select the relevant nodes.

Unlike for Switzerland, we can't only keep the $n$ most populated nodes, given they will be distributed all around France and most of them are probably not related to Switzerland, so won't help in any ways to detect flows.
In addition to population, we should probably take into account the distance to Switzerland.

>The idea would be to take the $n$ closest cities to Switzerland with a population more than $pop\_threshold$, for each of the four countries. 

These parameters can be adapted to get more or less nodes. 

### Distance to Switzerland
We need to define what does "close to Switzerland" means, since cities are not gonna be selected the same way if we consider the distance to the center, of to the nearest border for example.

Since it seems unreasonnably hard to work with borders directly, and too much generalization to work with the "center" of Switzerland directly, one good approach would be to consider the nodes already generated for Swiss cities. 

We know that our final goal is to *match* the foreign nodes we are trying to build with the previously built swiss nodes, so it makes sense to generate nodes which have the highest probability to get matched by a flow. Those nodes are nothing else than the clostest ones to the swiss nodes.

**Note** : we need to be careful with the number of nodes we generate (Swiss and foreign), since the complexity will explode with this parameter.

### Code

First, we filter out cities by population.

In [29]:
# Keep cities with pop > pop_threshold
pop_threshold = 15000
df = df[df['population'] > pop_threshold]

Now we need to import the previously generated nodes and filter out cities by distance to their respective closest swiss node.

In [30]:
import sys
sys.path.append('../swiss_flows')
from node import Node

In [32]:
swiss_nodes = Node.generate_swiss_nodes(n_nodes=10)



For each foreign city, we create a feature $distance$, which represents the distance to the closest swiss node.

In [72]:
def find_closest_swiss_node(lat, lon):
    # Create temporary node
    tmp = Node('tmp', (lat, lon), 0, None)
    
    best_dst = 99999999
    for node in swiss_nodes:
        dst = tmp.dist(node)
        best_dst = dst if dst < best_dst else best_dst
        
    return best_dst

In [73]:
# Create the new distance feature
df['distance'] = df.apply(lambda x: find_closest_swiss_node(x['latitude'], x['longitude']), axis=1)
df.head()

Unnamed: 0,asciiname,latitude,longitude,admin1 code,feature code,population,distance
11634,Saint-Louis,47.58836,7.56247,44,PPL,20871,3.429582
70253,Annemasse,46.19439,6.23775,84,PPL,28275,7.138707
5641,Thonon-les-Bains,46.36667,6.48333,84,PPLA3,31684,20.172341
24077,Mulhouse,47.75,7.33333,44,PPLA3,111430,27.873514
45624,Illzach,47.78088,7.34662,44,PPL,15457,30.001348


Now it remains to take the $n$ closest cities : 

In [76]:
n = 10

# Sort rows by distance 
df = df.sort_values(by='distance', ascending=True)

# Take the n first
df = df[:n]
df

Unnamed: 0,asciiname,latitude,longitude,admin1 code,feature code,population,distance
11634,Saint-Louis,47.58836,7.56247,44,PPL,20871,3.429582
70253,Annemasse,46.19439,6.23775,84,PPL,28275,7.138707
5641,Thonon-les-Bains,46.36667,6.48333,84,PPLA3,31684,20.172341
24077,Mulhouse,47.75,7.33333,44,PPLA3,111430,27.873514
45624,Illzach,47.78088,7.34662,44,PPL,15457,30.001348
70255,Annecy-le-Vieux,45.91971,6.14393,84,PPL,21521,31.413972
234,Wittenheim,47.8078,7.33702,44,PPL,15747,32.892333
70258,Annecy,45.9,6.11667,84,PPLA2,49232,33.679869
55459,Cran-Gevrier,45.9,6.1,84,PPL,19354,33.789798
7551,Seynod,45.88549,6.08831,84,PPL,18590,35.49612


Building the actual `Node`s out of it requires the same process than in the swiss node notebook.