# SC/BEP Data Insights

In [1]:
import pathlib
import pandas as pd
import geopandas as gpd
from geopy.distance import geodesic
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
import warnings
warnings.simplefilter("ignore")

## Forewords

### Disclaimer

 > Data insights provided in this notebook are not validated and may certainly contain errors. They are shared for feedback only with no warranty or liability of any kind.



### Sources

 - ULB/Faculty, Faculty mission: anonymized administrative forms (CSV);
 - ULB/DTAS/BEP, Travel Types Carbon Coefficient (Excel);
 - ArgGIS Hub, Country of the World (GeoJSON);
 - OSM, Nominatim Geocoding API (JSON);
 
### Methodology

 - Add index to mission records;
 - Permute reversed timestamps;
 - First cleansing of city names;
 - Split multiple destinations (origin is assumed to be unique): mission becomes travel;
 - Expand destinations;
 - Second cleansing of all city names;
 - Fill missing origin using Brussels as default;
 - Manually correct about 100 mispelled cities;
 - Geocode cleansed city names;
 - Add index to travel records;
 - Building geometries with Coordinate Reference System EPSG:4326;
 - Computing Geodesic Distance in km between geometries using WGS-84 ellipsoid as reference;

### To do

 - Fill missing travel types and merge with carbon coefficients;
 - Assess magnitude of different errors affecting distance and time to see where most of the effort should be spent;
 - Sequence travel with multiple destinations instead of considering as multiple distinct travels;
 - Do we double distance as agent must return (how do we know if it has already returned?);
 - Precisly define how long mission are handled;
 - Precisly define how mission with multiple destination are handled;
 - Check long distance around 18000 km are relevant (eg. Australia: has Vincity algorithm converged?);
 - Find Flight Distance table if any available;
 - How do we fill travel type when information lacks (distance criterion?);
 - Compute insightful metrics such as distance/elasped
 - Does PHILA stand for PHILO (typo)? 
 

## Load datasets

In [3]:
country = gpd.read_file(pathlib.Path("data/country.geojson").open().read())

In [4]:
travel = pd.read_excel('data/travels.xlsx')