# List 5 - Around the world in 80 days

In this list we had to find all cities visited by the main character Phileas Fogg in the book "Around the world in 80 days". Becuase many cities are mentioned in the novel, we had to find the best solution to extract only those that he actually stopped by.

Journey:
London – Paris – Turin – Brindisi - Brindisi – Suez – Aden – Bombay - Allahabad - Calcutta - Singapore - Hong Kong - Shanghai – Yokohama - San Francisco - – Salt Lake City – Medicine Bow – Fort Kearney – Omaha – Chicago – New York City - New York City – Queenstown – Dublin – Liverpool – London

# Getting data

In [38]:
import re
import time
import wikipedia
import warnings
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import plotly.express as px
from itertools import chain
from geotext import GeoText
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')

To identify the right cities, we create a list of keyword with weights. 

In [10]:
with open('book.txt', 'r') as f:
    book = f.readlines()[78:]

keywords = {"arriv": 2, "come": 0, "came": 0, "by boat": 0, "train": 3, "by ship": 0, "by steamer": 0,
            "by rail": 0, "reach": 5, "Reach": 5, "Left": 5, "enter": 3, "left": 0,
            "get to": 0, "stay": 5, "exit": 0, "return": 0, "back to": 0, "journey": 5,
            "start": 0, "crossing": 0, "passing": 4, "crosses": 7, "stop": 4}


The function "get_cotext" takes every line from the book and returns one line above and one under the first one.

In [11]:
def get_context(line_index):
    if line_index == 0:
        return book[line_index:line_index+1]

    if line_index >= 8229:
        return book[line_index-2:line_index]

    return book[line_index-2:line_index+1]

Because the book is from 1872, some cities are not up to date and have changed their name or status. For this purpose, we use two wikipedia article: "List of city name changes" and "Independent city".

https://en.wikipedia.org/wiki/List_of_city_name_changes
https://en.wikipedia.org/wiki/Independent_city

In [21]:
def _get_changed_cities():
    changed_cities = wikipedia.page("List_of_city_name_changes")
    soup = BeautifulSoup(changed_cities.html(), "html.parser")
    changed_cities_soup, change = [c for c in soup.find_all("ul")], []
    for i in changed_cities_soup:
        for j in i.find_all("li"):
            change.append(j.get_text())

    cities = _get_correct_site_part(change)
    return _get_cities_to_change(cities)


def _get_cities_to_change(cities):
    to_change = {}
    for city in cities:
        to_change[tuple([_clean_text(city[j].strip()) for j in range(len(city) - 1)])] = [_clean_text(city[-1].strip())]
    return to_change


def _get_correct_site_part(change):
    index = change.index('Alexandria Ariana → Herat')
    index2 = change.index('List of administrative division name changes')
    change = change[index:index2]
    return [pair.split('→') for pair in change]


def _clean_text(text):
    pattern = "\([[0-9]+\)+|\[[[0-9]+\]"
    text = re.sub(pattern, '', text)
    return text.strip()

In [22]:
def _replace_city_names(line, changes):
    for pair, val in changes.items():
        for i in range(len(pair)):
            if pair[i] and pair[i] in line and pair[i] != "York" and pair[i] != "Lake City":
                if GeoText(pair[i]).cities:
                    if GeoText(pair[i]).cities[0] in line:
                        return line
                line = line.replace(pair[i], val[0])
    return line

def _get_city_states():
    city_states = wikipedia.page("City-state").content
    return [i for i in GeoText(city_states).countries if not any([i + ":" in city_states, i in ['Greece', 'France',
                                                                                                'Portugal', 'Mongolia',
                                                                                                'Japan', 'Italy']])]

The function "look_for_place" finds all cities in a given line. To identify cities in the text, we use 'geotext' library. 

Additionally we check word by word to find a city in special cases, like identifying the city if it is a second word in a sentence. 

In [23]:
GeoText("Our Dublin is a great city").cities

[]

In [24]:
GeoText("Dublin is a great city").cities

['Dublin']

In [25]:
def look_for_place(line, lineindex):
    line = _replace_city_names(line, changed_cities)
    places = GeoText(line)
    extra_places = list(chain(*[GeoText(word).cities for word in str(line).split(" ") if word not in (" ").join(places.cities)]))
    city_states = [i for i in GeoText(line).countries if i in city_states_list]
    context = get_context(lineindex)
    for sentence in context:
        found_places = _look_for_key_in_sentence(sentence, places, extra_places, city_states)
        if found_places:
            return found_places

The "look_for_key_in_sentence" checks if there are keywords in a sentence.

In [26]:
def _look_for_key_in_sentence(sentence, places, extra_places, city_states):
    for key in keywords.keys():
        if key in sentence:
            if keywords[key]:
                return list(chain(*[list(set(places.cities + extra_places + city_states))
                                    for i in range(keywords[key])]))
            return list(set(places.cities + extra_places + city_states))

Using above described functions, we search through whole book for travel cities. To get the valiable cities, we return only these which achieved more than 9 points.

In [27]:
def search_for_places(book):
    cities = {}
    for index, line in enumerate(book):
        places = look_for_place(line, index)
        if places:
            cities = _add_places_to_cities(cities, places)
    return {key: value for key, value in cities.items() if value > 9}


def _add_places_to_cities(cities, places):
    for i in range(len(places)):
        if places[i]:
            if places[i] in cities.keys():
                cities[places[i]] += 1
            else:
                cities[places[i]] = 1
    return cities

In [36]:
changed_cities = _get_changed_cities()
city_states_list = _get_city_states()

In [40]:
travel_path = search_for_places(book)
travel_path

{'London': 76,
 'Brindisi': 15,
 'Suez': 43,
 'Mumbai': 70,
 'Kolkata': 53,
 'Hong Kong': 59,
 'Yokohama': 57,
 'San Francisco': 37,
 'New York': 46,
 'York': 20,
 'Paris': 17,
 'Turin': 10,
 'Singapore': 14,
 'Aden': 10,
 'Shanghai': 12,
 'Liverpool': 31,
 'Sioux': 21,
 'Omaha': 44,
 'Salt Lake': 10,
 'Salt': 18,
 'Kearney': 20,
 'Chicago': 15,
 'Queenstown': 11,
 'Dublin': 11}

# Map plot

To get the coordinates of our cities, we use "geopy.geocoders" library and Szwabin's function.

In [43]:
geolocator = Nominatim(user_agent="AroundTheWorld")
positions = dict()
for c in travel_path:
    while True:
        try:
            position = geolocator.geocode(c)
        except:
            time.sleep(5)
            continue
        break
    if position:
        location = [position.latitude, position.longitude]
        positions.update({c: location})
        del position
    else:
        print("Could not get position for {}".format(c))

We transform received dictionary to pandas Data Frame.

In [44]:
df = pd.DataFrame.from_dict(positions, orient='index')
df = df.rename({0: 'Latitude', 1: 'Longitude'}, axis=1)
df = df.reset_index()
df

Unnamed: 0,index,Latitude,Longitude
0,London,51.507322,-0.127647
1,Brindisi,40.63592,17.688443
2,Suez,29.974498,32.537086
3,Mumbai,19.07599,72.877393
4,Kolkata,22.541418,88.357691
5,Hong Kong,22.279328,114.162813
6,Yokohama,35.444991,139.636768
7,San Francisco,37.779026,-122.419906
8,New York,40.712728,-74.006015
9,York,53.959055,-1.081536


In [46]:
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
gdf

Unnamed: 0,index,Latitude,Longitude,geometry
0,London,51.507322,-0.127647,POINT (-0.12765 51.50732)
1,Brindisi,40.63592,17.688443,POINT (17.68844 40.63592)
2,Suez,29.974498,32.537086,POINT (32.53709 29.97450)
3,Mumbai,19.07599,72.877393,POINT (72.87739 19.07599)
4,Kolkata,22.541418,88.357691,POINT (88.35769 22.54142)
5,Hong Kong,22.279328,114.162813,POINT (114.16281 22.27933)
6,Yokohama,35.444991,139.636768,POINT (139.63677 35.44499)
7,San Francisco,37.779026,-122.419906,POINT (-122.41991 37.77903)
8,New York,40.712728,-74.006015,POINT (-74.00602 40.71273)
9,York,53.959055,-1.081536,POINT (-1.08154 53.95906)


Visited cities are presented on the below maps.

In [83]:
geo_df = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))

fig = px.line_geo(gdf, lat=gdf.geometry.y, lon=gdf.geometry.x, hover_name='index', projection='natural earth', color_discrete_sequence=["orangered"])
fig.update_geos(
    resolution=50,
    showcoastlines=True, coastlinecolor="mediumaquamarine",
    showland=True, landcolor="Green",
    showocean=True, oceancolor="LightBlue",
    showlakes=True, lakecolor="blue",
    showrivers=True, rivercolor="lightblue"
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()

In [92]:
fig = px.scatter_mapbox(gdf, lat=gdf.geometry.y, lon=gdf.geometry.x, hover_name="index", zoom=1, color_discrete_sequence=["red"])
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## Summary

1. The correct path from the book contained of 23 cities. Our code has found 24 - we are missing 2 (Allahabad and Medicine Bow) and we have 3 extra cities (York, Salt, Sioux).
2. Due to the age of the book, we had to face 2 problems: one related to changes in city names and second one related to the administrative changes.
3. The library "geonamecache" is also a popular one for finding cities and their alternative names, but can't extract these from a whole sentence.
4. Two cities that we weren't find was due to a fact that the "geotext" library doesn't recognize these cities.
5. Our scoring system is quite accurate.