# Looking for patterns in city names

Recently, I was travelling around New Zealand, and noticed in the maori language they use letters back to back a lot like in the original Maori name for Stratford ("whakaahurangi").

So as any normal person does, I thought, well what town has the most repeated letters, and the idea for this blog post was born.

Firstly, we'd have to find a dataset of all the town names, and found the database for all world cities hosted on Kaggle here: https://www.kaggle.com/max-mind/world-cities-database.


In [15]:
import pandas as pd
import collections
from collections import OrderedDict
import operator
# import matplotlib.pyplot as plt
import numpy as np
import math
from bokeh.plotting import figure, show, output_notebook,output_file, save
from bokeh.tile_providers import CARTODBPOSITRON
from bokeh.models import ColumnDataSource, HoverTool
from pyproj import Proj, transform
from bokeh.resources import CDN
from bokeh.embed import file_html
# output_notebook()



## Get the data!

In [16]:
# data source https://www.kaggle.com/max-mind/world-cities-database
cities_df = pd.read_csv('content/notebooks/looking-for-patterns-in-city-names-interactive-plotting/data/worldcitiespop.csv', header=0, sep=',', quotechar='"')
cities_df = cities_df[cities_df['Country'] == "nz"]
display(cities_df.head(5))

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
2047989,nz,abbotsford,Abbotsford,G2,,-45.883333,170.416667
2047990,nz,adams flat,Adams Flat,G3,,-46.116667,169.833333
2047991,nz,addington,Addington,E9,,-43.55,172.616667
2047992,nz,admiralty bay,Admiralty Bay,F7,,-40.95,173.916667
2047993,nz,ahaura,Ahaura,G3,,-42.35,171.533333


After inspecting the data of this dataset, we're able to filter out to look at just New Zealand with the prefix of "nz" in the Country column. It must be noted that this dataset represents the names of the towns currently, and not the original Maori names, more on this will be covered in a later post. Now we want to extract the town names
out of the dataframe with the ones we want to analyse. For ease later on, we will extract this as a dictionary,
such that we can assign the value of each to the count of each letter.

In [17]:
nz_cities = cities_df[cities_df['Country'] == "nz"]['AccentCity'].tolist()
nz_dict = { i : 0 for i in nz_cities }
display(nz_dict)

{'Abbotsford': 0,
 'Adams Flat': 0,
 'Addington': 0,
 'Admiralty Bay': 0,
 'Ahaura': 0,
 'Ahikiwi': 0,
 'Ahimia': 0,
 'Ahipara': 0,
 'Ahititi': 0,
 'Ahuiti': 0,
 'Ahurangi': 0,
 'Ahuriri': 0,
 'Ahuriri Village': 0,
 'Ahuroa': 0,
 'Aickens': 0,
 'Aka Aka': 0,
 'Akatore': 0,
 'Akerama': 0,
 'Akitio': 0,
 'Albany': 0,
 'Albert Road': 0,
 'Albert Town': 0,
 'Albury': 0,
 'Alford Forest': 0,
 'Alfredton': 0,
 'Algies Bay': 0,
 'Allandale': 0,
 'Allanton': 0,
 'Allerton': 0,
 'Alma': 0,
 'Altimarloch': 0,
 'Altimarlock': 0,
 'Alton': 0,
 'Amberley': 0,
 'Amodeo': 0,
 'Amodeo Bay': 0,
 'Anama': 0,
 'Annat': 0,
 'Aohanga': 0,
 'Aokautere': 0,
 'Aongatete': 0,
 'Aorere': 0,
 'Aotea': 0,
 'Aotuhia': 0,
 'Aparima': 0,
 'Apata': 0,
 'Apiti': 0,
 'Appleby': 0,
 'Arahura': 0,
 'Arai Point': 0,
 'Aramiro': 0,
 'Aramoana': 0,
 'Aramoho': 0,
 'Aranga': 0,
 'Aranui': 0,
 'Arapae': 0,
 'Arapito': 0,
 'Arapohue': 0,
 'Arapuni': 0,
 'Ararata': 0,
 'Ararimu': 0,
 'Aratapu': 0,
 'Arcadia': 0,
 'Ardlussa': 0,

Now we will create an ordered dictionary with the help from the collections package which will store the values
of the count for each letter in the town name.

In [18]:
letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

lcount = dict(OrderedDict([(l, 0) for l in letters]))
display(lcount)

{'A': 0,
 'B': 0,
 'C': 0,
 'D': 0,
 'E': 0,
 'F': 0,
 'G': 0,
 'H': 0,
 'I': 0,
 'J': 0,
 'K': 0,
 'L': 0,
 'M': 0,
 'N': 0,
 'O': 0,
 'P': 0,
 'Q': 0,
 'R': 0,
 'S': 0,
 'T': 0,
 'U': 0,
 'V': 0,
 'W': 0,
 'X': 0,
 'Y': 0,
 'Z': 0}

Now it's time for the data crunch. To count how many times a letter repeats in a town name we follow these steps:

- we create a for loop, to loop through all the city names in the table,
- initialise an ordered dictionary similar to above for each city in the value field of that town's dictionary entry
- loop through each letter in the town name
- check if the letter appears in our letter dictionary (mainly to not count spaces),
- Then if the letter does appear, increment the value for that letter by 1

This results in a dictionary for each town name, with the count of repeated letters.

In [19]:
for city in nz_cities:
    nz_dict[city] = dict(OrderedDict([(l, 0) for l in letters]))
    city_dict = nz_dict[city]
    for c in city:
        if c.upper() in letters:
            city_dict[c.upper()] += 1

display(nz_dict) 

{'Abbotsford': {'A': 1,
  'B': 2,
  'C': 0,
  'D': 1,
  'E': 0,
  'F': 1,
  'G': 0,
  'H': 0,
  'I': 0,
  'J': 0,
  'K': 0,
  'L': 0,
  'M': 0,
  'N': 0,
  'O': 2,
  'P': 0,
  'Q': 0,
  'R': 1,
  'S': 1,
  'T': 1,
  'U': 0,
  'V': 0,
  'W': 0,
  'X': 0,
  'Y': 0,
  'Z': 0},
 'Adams Flat': {'A': 3,
  'B': 0,
  'C': 0,
  'D': 1,
  'E': 0,
  'F': 1,
  'G': 0,
  'H': 0,
  'I': 0,
  'J': 0,
  'K': 0,
  'L': 1,
  'M': 1,
  'N': 0,
  'O': 0,
  'P': 0,
  'Q': 0,
  'R': 0,
  'S': 1,
  'T': 1,
  'U': 0,
  'V': 0,
  'W': 0,
  'X': 0,
  'Y': 0,
  'Z': 0},
 'Addington': {'A': 1,
  'B': 0,
  'C': 0,
  'D': 2,
  'E': 0,
  'F': 0,
  'G': 1,
  'H': 0,
  'I': 1,
  'J': 0,
  'K': 0,
  'L': 0,
  'M': 0,
  'N': 2,
  'O': 1,
  'P': 0,
  'Q': 0,
  'R': 0,
  'S': 0,
  'T': 1,
  'U': 0,
  'V': 0,
  'W': 0,
  'X': 0,
  'Y': 0,
  'Z': 0},
 'Admiralty Bay': {'A': 3,
  'B': 1,
  'C': 0,
  'D': 1,
  'E': 0,
  'F': 0,
  'G': 0,
  'H': 0,
  'I': 1,
  'J': 0,
  'K': 0,
  'L': 1,
  'M': 1,
  'N': 0,
  'O': 0,
  'P': 0,

Hooray! Now we have all the data we need broken down and ready for analysis. To help ease the analysis and make it more readable for a human, we convert from our nested dictionaries to a pandas dataframe and transpose it such that we
have the town name as the index, the letters as the column and the count of that letter as the values.

In [20]:
total_df = pd.DataFrame.from_dict(nz_dict)
total_df = total_df.T
display(total_df)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,Q,R,S,T,U,V,W,X,Y,Z
Abbotsford,1,2,0,1,0,1,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
Adams Flat,3,0,0,1,0,1,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
Addington,1,0,0,2,0,0,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
Admiralty Bay,3,1,0,1,0,0,0,0,1,0,...,0,1,0,1,0,0,0,0,2,0
Ahaura,3,0,0,0,0,0,0,1,0,0,...,0,1,0,0,1,0,0,0,0,0
Ahikiwi,1,0,0,0,0,0,0,1,3,0,...,0,0,0,0,0,0,1,0,0,0
Ahimia,2,0,0,0,0,0,0,1,2,0,...,0,0,0,0,0,0,0,0,0,0
Ahipara,3,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,0
Ahititi,1,0,0,0,0,0,0,1,3,0,...,0,0,0,2,0,0,0,0,0,0
Ahuiti,1,0,0,0,0,0,0,1,2,0,...,0,0,0,1,1,0,0,0,0,0


Now we want to find which of these names have the maximum count for any particular letter and store it in a summary dataframe. It is to be noted that we could use the pivot function with aggregate types, however, I have not figured a nice way to do this yet. If you do know a nicer way to determine this, please let me know.

In [21]:
summary_df = pd.DataFrame()
scale = 1
summary_df['City_Name'] = total_df.idxmax()
summary_df['Count'] = total_df.loc[total_df.idxmax()].max()
display(summary_df)

Unnamed: 0,City_Name,Count
A,Kaingapai Hakataramea Station,9
B,Abbotsford,2
C,Christchurch,3
D,Edendale Town District,3
E,Earnscleugh Settlement,5
F,Flagstaff,3
G,Kyeburn Diggings,3
H,Christchurch,3
I,Kihikihi Town District,6
J,Clarks Junction,1


Now by using the equivalent of an index-match in excel which you can read more about here (https://towardsdatascience.com/name-your-favorite-excel-function-and-ill-teach-you-its-pandas-equivalent-7ee4400ada9f). Admittedly, we could've made the join earlier, but since I use index-match so often in Excel, I wanted to learn how to do the same in pandas. This is achieved by using the map function (which is the equivalent of the index), but by using the index of another dataframe as the argument (the match function), we can rejoin the dataset by matching the city name from our original dataset.

In [22]:
summary_df['Latitude'] = summary_df['City_Name'].map(cities_df.set_index(['AccentCity'])['Latitude'].to_dict()) * scale
summary_df['Longitude'] = summary_df['City_Name'].map(cities_df.set_index(['AccentCity'])['Longitude'].to_dict()) * scale
display(summary_df)

Unnamed: 0,City_Name,Count,Latitude,Longitude
A,Kaingapai Hakataramea Station,9,-44.6,170.566667
B,Abbotsford,2,-45.883333,170.416667
C,Christchurch,3,-43.533333,172.633333
D,Edendale Town District,3,-46.316667,168.783333
E,Earnscleugh Settlement,5,-45.216667,169.316667
F,Flagstaff,3,-45.833333,170.483333
G,Kyeburn Diggings,3,-45.0,170.283333
H,Christchurch,3,-43.533333,172.633333
I,Kihikihi Town District,6,-38.033333,175.35
J,Clarks Junction,1,-45.733333,170.05


Now we have a dataframe that contains:
- an index of the letters,
- the town name with the most repeated letters,
- the count of the letters within the name,
- the longitude and latitude of the town

For plotting with bokeh on a basemap, we need to convert from longitude & latitude to easting and northing. To do this we use the pyproj package to make this very simple.

In [23]:
def LongLat_to_EN(long, lat):
    try:
      easting, northing = transform(
        Proj(init='epsg:4326'), Proj(init='epsg:3857'), long, lat)
      return easting, northing
    except:
      return None, None

This function can be used to generate the easting and northing for every town from it's longitude & latitude and add it to the dataframe.

In [24]:
summary_df['E'], summary_df['N'] = zip(*summary_df.apply(lambda x: LongLat_to_EN(x['Longitude'], x['Latitude']), axis=1))

display(summary_df)

Unnamed: 0,City_Name,Count,Latitude,Longitude,E,N
A,Kaingapai Hakataramea Station,9,-44.6,170.566667,18987390.0,-5558768.0
B,Abbotsford,2,-45.883333,170.416667,18970700.0,-5761673.0
C,Christchurch,3,-43.533333,172.633333,19217450.0,-5393506.0
D,Edendale Town District,3,-46.316667,168.783333,18788870.0,-5831241.0
E,Earnscleugh Settlement,5,-45.216667,169.316667,18848250.0,-5655696.0
F,Flagstaff,3,-45.833333,170.483333,18978120.0,-5753681.0
G,Kyeburn Diggings,3,-45.0,170.283333,18955850.0,-5621521.0
H,Christchurch,3,-43.533333,172.633333,19217450.0,-5393506.0
I,Kihikihi Town District,6,-38.033333,175.35,19519870.0,-4584136.0
J,Clarks Junction,1,-45.733333,170.05,18929880.0,-5737718.0


Finally, it's time to plot our findings on a map. Before we initialise the map in Bokeh [LINK BOKEH], for most plots, data tables and more in Bokeh, we need to put it in the ColumnDataSource form. We also initialise the interactivity when the user hovers over the data points on the plot.

In [25]:
source = ColumnDataSource(data=dict(
                        longitude=list(summary_df['E']), 
                        latitude=list(summary_df['N']),
                        sizes=list(summary_df['Count']*3),
                        lettercount = list(summary_df['Count']),
                        city_name=list(summary_df['City_Name']),
                        letters = list(summary_df.index)))

hover = HoverTool(tooltips=[
    ("Repeated Letter" , "@letters"),
    ("City Name", "@city_name"),
    ("Count","@lettercount")
    
])

Finally time for the plot! Now admittedly, I haven't found an easy way to find the limits of the graph, so this was made with a lot of trial and error (If you know a better way, please let me know!).

In [26]:
p = figure(x_range=(20000000,17900000), y_range=(-6000000,-4000000),x_axis_type="mercator", y_axis_type="mercator",tools=[hover, 'wheel_zoom','save'])
p.add_tile(CARTODBPOSITRON)
p.circle(x='longitude',
         y='latitude', 
         size='sizes',
         source=source,
         line_color="#FF0000", 
         fill_color="#FF0000",
         fill_alpha=0.05)
# output_notebook()
# file_html(p,CDN)

'\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  \n  <head>\n    \n      <meta charset="utf-8">\n      <title>Bokeh Application</title>\n      \n      \n        \n          \n        \n        \n          \n        <script type="text/javascript" src="https://cdn.pydata.org/bokeh/release/bokeh-1.3.1.min.js"></script>\n        <script type="text/javascript">\n            Bokeh.set_log_level("info");\n        </script>\n        \n      \n      \n    \n  </head>\n  \n  \n  <body>\n    \n      \n        \n          \n          \n            \n              <div class="bk-root" id="cb211d9d-cfc1-403a-b8e5-aeb890f49544" data-root-id="1192"></div>\n            \n          \n        \n      \n      \n        <script type="application/json" id="1304">\n          {"bb5b494f-dcc6-4af0-a176-a4c085e0a94d":{"roots":{"references":[{"attributes":{"dimension":1,"ticker":{"id":"1211","type":"MercatorTicker"}},"id":"1218","type":"Grid"},{"attributes":{},"id":"1197","type":"LinearScale"},{"attributes":{"attri

In [28]:
# output_file("NZ_City_Letter_Analysis.html")
# save(p)
html = file_html(p, CDN, "NZ_City_Letter_Analysis")
from IPython.core.display import HTML
HTML(html)