# Nodes generation

This notebook describes the proces to generate the nodes from the [geonames downloads](http://download.geonames.org/export/dump/) data.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Geonames Data

In [2]:
# Data delimited by tabs, utf-8 encoding
df = pd.read_csv('../data/CH/CH.txt', header=None, encoding='utf8', delimiter='\t', dtype={9: str})
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,2657883,Zuger See,Zuger See,"Lac de Zoug,Lago di Zug,Lai da Zug,Lake Zug,La...",47.1313,8.48335,H,LK,CH,,00,,,,0,413.0,411,Europe/Zurich,2012-02-01
1,2657884,Zwischbergental,Zwischbergental,"Zwischberg-Thal,Zwischbergental",46.16667,8.13333,T,VAL,CH,CH,VS,,,,0,,1671,Europe/Zurich,2012-01-17
2,2657885,Zwischbergen,Zwischbergen,"Zwischbergen,ci wei shi bei gen,茨維施貝根",46.16366,8.11575,P,PPL,CH,,VS,2301.0,6011.0,,127,,1322,Europe/Zurich,2012-01-17
3,2657886,Zwingen,Zwingen,"Cvingen,ci wen gen,Цвинген,茨溫根",47.43825,7.53027,P,PPL,CH,,BL,1302.0,2793.0,,2162,,342,Europe/Zurich,2013-02-28
4,2657887,Zweisimmen,Zweisimmen,"Cvajzimmen,Zweisimmen,Zweisimmeni vald,ci wei ...",46.55452,7.37385,P,PPL,CH,,BE,248.0,794.0,,2813,,944,Europe/Zurich,2013-02-28


In [3]:
# Build the index
index = ['geonameid', 'name', 'asciiname', 'alternatenames', 'latitude', 'longitude', 'feature class',
         'feature code', 'country code', 'cc2', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code',
         'population', 'elevation', 'dem', 'timezone', 'modification date']

df.columns = index
df.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature class,feature code,country code,cc2,admin1 code,admin2 code,admin3 code,admin4 code,population,elevation,dem,timezone,modification date
0,2657883,Zuger See,Zuger See,"Lac de Zoug,Lago di Zug,Lai da Zug,Lake Zug,La...",47.1313,8.48335,H,LK,CH,,00,,,,0,413.0,411,Europe/Zurich,2012-02-01
1,2657884,Zwischbergental,Zwischbergental,"Zwischberg-Thal,Zwischbergental",46.16667,8.13333,T,VAL,CH,CH,VS,,,,0,,1671,Europe/Zurich,2012-01-17
2,2657885,Zwischbergen,Zwischbergen,"Zwischbergen,ci wei shi bei gen,茨維施貝根",46.16366,8.11575,P,PPL,CH,,VS,2301.0,6011.0,,127,,1322,Europe/Zurich,2012-01-17
3,2657886,Zwingen,Zwingen,"Cvingen,ci wen gen,Цвинген,茨溫根",47.43825,7.53027,P,PPL,CH,,BL,1302.0,2793.0,,2162,,342,Europe/Zurich,2013-02-28
4,2657887,Zweisimmen,Zweisimmen,"Cvajzimmen,Zweisimmen,Zweisimmeni vald,ci wei ...",46.55452,7.37385,P,PPL,CH,,BE,248.0,794.0,,2813,,944,Europe/Zurich,2013-02-28


We are interested in the names, coordinates, population and maybe cantons. We also need the feature code columns since this feature tells us if it's a city, a lake, a region.. So rows without this information are useless: 

In [4]:
print('NaN values : ')
print('\t Latitude : {}.'.format(df['latitude'].isnull().sum()))
print('\t Longitude : {}.'.format(df['longitude'].isnull().sum()))
print('\t Cantons : {}.'.format(df['admin1 code'].isnull().sum()))
print('\t Population : {}.'.format(df['population'].isnull().sum()))
print('\t Feature code : {}.'.format(df['feature code'].isnull().sum()))

NaN values : 
	 Latitude : 0.
	 Longitude : 0.
	 Cantons : 5.
	 Population : 0.
	 Feature code : 0.


The data seems to be dense for the features we are interested in, that's nice.

It seems the population is 0 pretty often though, these places doesn't correspond to cities : 

In [5]:
nullpop = df['population'].value_counts()[0] / len(df['population'])
print('0 population : {} %'.format(nullpop))

0 population : 0.8276400617452102 %


We drop these rows and columns we don't need. See [this page](http://www.geonames.org/export/codes.html) for the feature code references. 

In [6]:
# Drop null population
df = df[df['population'] != 0]

# Keep only cities : PPL or PPLA or PPLC feature code
df = df[df['feature code'].str.contains(r'PPL(A\d?|C)?$')]

# Keep the columns we need
df = df[['asciiname', 'latitude', 'longitude', 'admin1 code', 'feature code','population']]
print('Shape : {}'.format(df.shape))
df.head()

Shape : (2863, 6)




Unnamed: 0,asciiname,latitude,longitude,admin1 code,feature code,population
2,Zwischbergen,46.16366,8.11575,VS,PPL,127
3,Zwingen,47.43825,7.53027,BL,PPL,2162
4,Zweisimmen,46.55452,7.37385,BE,PPL,2813
6,Zuzwil,47.47452,9.11196,SG,PPL,4226
7,Zuzgen,47.52508,7.89986,AG,PPL,863


## Node selection

The question now is to decide which cities we should keep or not to create our nodes. Given main flows probably occur between most important cities, we decide to filter the cities by population. We take the $n$ most populated cities, with $n$ being a parameter.

In [7]:
n = 10

# Sort rows by population 
df = df.sort_values(by='population', ascending=False)

# Take the n first
df = df[:n]
df

Unnamed: 0,asciiname,latitude,longitude,admin1 code,feature code,population
13,Zurich,47.36667,8.55,ZH,PPLA,341730
2700,Geneve,46.20222,6.14569,GE,PPLA,183981
3638,Basel,47.55839,7.57327,BS,PPLA,164488
3586,Bern,46.94809,7.44744,BE,PPLC,121631
2059,Lausanne,46.516,6.63282,VD,PPLA,116751
85,Winterthur,47.50564,8.72413,ZH,PPLA2,91908
914,Sankt Gallen,47.42391,9.37477,SG,PPLA,70572
1881,Luzern,47.05048,8.30635,LU,PPLA,57066
3547,Biel/Bienne,47.13713,7.24608,BE,PPLA2,48614
482,Thun,46.75118,7.62166,BE,PPLA2,42136


We now have our $n$ most populated cities.

## Creating the nodes

We are now ready to create our nodes from the dataframe rows, we use the `class Node`, from the `node.py` module.

In [8]:
import sys
sys.path.append('../swiss_flows')

from node import Node

In [9]:
# List of nodes
nodes = []

# Iterate over the rows
for row in df.iterrows():
    args = {
        'name': row[1].asciiname, 
        'position': (row[1].latitude, row[1].longitude),
        'population': row[1].population,
        'canton': row[1]['admin1 code']
    }
    nodes.append(Node(**args))

In [10]:
print(nodes[0])

[Node] Zurich, ZH, (47.36667, 8.55), radius = 10.


## Saving results

Given we will use the generated nodes during the whole project, it's better if we make some effort to save this list, in order to avoid generating it each time. We use the [pickle](https://docs.python.org/2/library/pickle.html) module: 

In [None]:
import pickle

### Saving the list

In [17]:
with open('nodes.pkl', 'wb') as file:
    pickle.dump(nodes, file)

### Loading the list

In [22]:
with open('nodes.pkl', 'rb') as file:
    list = pickle.load(file)
    
print(list[0])

[Node] Zurich, ZH, (47.36667, 8.55), radius = 10.
