# Table of contents

>[Elections cleaning pipeline](#scrollTo=R7TvXCuJNC1j)

>>[Setting the working directory](#scrollTo=8G8tKDGUNnFW)

>>[Loading the elections data](#scrollTo=sGemIXUoNs3N)

>>>[Selecting relevant data](#scrollTo=gxIk68apPXSD)

>>>>[Variables explanation](#scrollTo=gxIk68apPXSD)

>>[Matching locations](#scrollTo=S4KJSJ6-xSdd)

>>>[Loading the press dataset](#scrollTo=ZL44UMR9y1iT)

>>>[Filtering out problematic counties](#scrollTo=a-KoEJCvXVnI)

>>>[Removing Eire places](#scrollTo=5nIyCxLqTxgs)

>>>[Londonberry issue](#scrollTo=WTiRfqOXsLHt)

>>>[Removing "county" and "city" from names (Irish places)](#scrollTo=a9oxP1A89HmX)

>>>[Removing "districts of burghs/boroughs"](#scrollTo=IwCCzvQI1SFd)

>>>[Removing university constituencies](#scrollTo=TdSS-uDD5amZ)

>>>[Creating dictionaries to match constituency to county](#scrollTo=41s7tynCakmQ)

>>[Prepping unmatched places for geocoding](#scrollTo=tbA2KeADlfB0)

>>>[Try a fuzzy-match approach to manually inspect possible matches](#scrollTo=nr-6CW23vfhb)

>>>[Creating fuzzy match dictionaries](#scrollTo=bnw3yZHI8Skq)

>>[Fuzzy match with lower thresholds](#scrollTo=p8iUgS0v7BQA)

>>>>[Pattern 1](#scrollTo=-ZGRLY_E7wKk)

>>>[Patterns 2 and 3](#scrollTo=TVeZuBuz7z5G)

>>>[Filtering out constituencies referencing multiple places](#scrollTo=5YUSQTk9K3pB)

>>[Geocoding](#scrollTo=ge8saIOdatLD)

>>>[Running the geocoding](#scrollTo=dBBxZJm8cIWI)

>>>[Removing Eire places - part 2](#scrollTo=5z-sAlYp3BiE)

>>>[Matching coordinates to areas](#scrollTo=HxXGbKi-uzmI)

>[Final check](#scrollTo=FhyPfJYlbvvO)



# Elections cleaning pipeline

The original elections dataset called `CLEA.xlsx` comes from the CLEA project by the university of Michigan.
Their datasets contain historical elections' data for most countries, as such they are split in two groups (alphabetically by country).
The file used here is the one with the second half of the countries, among which `UK` for the United Kingdom.

## Setting the working directory

If you are running this notebook on Google Colab, you can use the code below to mount your Google Drive as the working directory.


```
from google.colab import drive
drive.mount('/content/drive')
```

The file structure for this and the other notebooks should be the following:
```
├── Main working directory
│   ├── pickles (folder)
│   │   ├── *elections_cleaned.pkl (generated by this file)
│   │   ├── *cleaned_press_directories.pkl (necessary for this file)
│   ├── input (folder)
│   ├── output (folder)
│   ├── index.html
│   ├── main.js
│   ├── style.css
```



## Loading the elections data

For the first run, the full dataset will be loaded (with over 640 thousand lines). On that first run, a [pickle](https://docs.python.org/3/library/pickle.html) will be created to store the data and avoid having to re-run the onerous loading process.

In [None]:
import pickle
import pandas as pd
try:
  with open("pickles/elections.pkl", 'rb') as f:
      elections = pickle.load(f)
except:
  elections = pd.read_excel("input/CLEA.xlsx")
  # this restricts the dataset to the UK electoral data
  elections = elections[elections['ctr_n'] == 'UK']

  # this restricts the dataset to the timeframe of the Press Directories dataset
  elections = elections[elections['yr'] < 1923]
  elections = elections[elections['yr'] > 1845]

  elections = elections.reset_index(drop = True)


  # Save the elections dataset as a pickle file for future use
  with open("pickles/elections.pkl", 'wb') as f:
      pickle.dump(elections, f)
      f.close()

### Selecting relevant data
The full datasets spans over 30 columns and contains many different types of data, but only some of it is relevant to the process.
A copy of the extended dataset is kept in the pickle because it might be relevant.

#### Variables explanation
Some variables (i.e. column names) are explained here, the full explanation of the complete dataset cna be found in the CLEA documentation.

| Variable | Meaning                                                                                                                                   |
|----------|-------------------------------------------------------------------------------------------------------------------------------------------|
| `id`     | ID of that specific election, relevant because in some years there are multiple elections (with different IDs)                            |
| `yr`     | Year in which the election took place                                                                                                     |
| `cst_n`    | Name of the constituency                                                                                                                  |
| `sub`      | Country's subdivision to which the constituency belongs (here, mostly country nations and/or type of constituency like county vs borough) |
| `cst`      | Numerical ID of the constituency                                                                                                          |
| `mag`      | Magnitude, i.e. how many MPs the constituency elects                                                                                      |

In [None]:
elections_reduced = elections[["id","yr", "cst_n", "sub", "cst", "mag", "pty_n", "pty", "can","vot1", "pev1", "pv1", "pvs1", "can", "cvs1" ]]

In [None]:
elections_replaced = elections_reduced.copy().reset_index(drop = True)


## Matching locations
Finding a match for all electoral constituencies is a process based on two steps, one using the Press Directories dataset to find corresponding counties, and a second one trying to match the remaining places using a geocoding service.

The geocoding part has its limitations, but it allows for a faster turnaround that can then be tweaked manually. Given that this project mostly concerned the Press Directories rather than the electoral dataset, I went along with the tweaking process until I had a generally satisfactory result, but in the future it could (and should) be improved.

The best way to obtain these matches would be to go through official data from the history of Boundary Commissions, but at the moment this is hard to do.

Finally, a good alternative might be matching constituency names with constituency boundaries, and then use the boundaries to find overlaps with historic counties. This is technically possible, but I could not find historic boundaries data with permissive licensing.

### Loading the press dataset

After cleaning the data with a dedicated script, the Press Directories dataset should be available as a pickle.

In [None]:
with open("pickles/cleaned_press_directories.pkl", 'rb') as f:
      press = pickle.load(f)

### Filtering out problematic counties

In few cases (13), there are multiple districts with the same name in the Press dataset, thus linking to different counties. Those districts cannot be clearly paired with a corresponding constituency, so they are filtered out to avoid issues. One possibility to match them back up is use other sources to see where the constituency was located (possibly cross-referencing electoral outcomes) thus settling out the correspondence.

In [None]:
press_districts_counties = press[["county", "district", "map_county", "year"]]

# This groups the counties by district
district_counties_duplicate = press_districts_counties.groupby('district')['map_county'].agg(["unique"])

# This adds a column counting how many counties are there for each district
district_counties_duplicate['county_count'] = district_counties_duplicate['unique'].apply(len)

# Filter districts with county_count higher than 1
filtered_districts = district_counties_duplicate[district_counties_duplicate['county_count'] > 1].reset_index()

# Converting the districts column into a list, I can then filter out all rows
# in the election dataset where the constituency name is the same as that list
elections_replaced = elections_replaced[~elections_replaced["cst_n"].isin(list(filtered_districts["district"]))]

In [None]:
del filtered_districts, district_counties_duplicate, f

### Removing Eire places
In the case of the elections dataset, the `sub` variable tells me if a place is in Ireland, so I do not need to design a specific matching pattern as I did for the press dataset. However, here Eire and Northern Ireland counties are all counted as Ireland before 1921, so I still need to distinguish between the two.

In [None]:
irish_cst_n = sorted(list(elections_replaced[elections_replaced["sub"]== "ireland"]["cst_n"].unique()))
# Printing out the supposed Irish constituencies is a necessary step to check if
# there are any that are not actually Irish.

From this list, I can then take use the same list of Eire counties I used for the press dataset to take out those that are surely in Eire.
This won't exclude all Eire places, because some places won't have an Eire county in their name.
This is not particularly problematic because they will be removed after the geocoding, when they only match Eire.

In [None]:
# List counties in the current Republic of Ireland
eire_counties = [
    "carlow",
    "cavan",
    "clare",
    "cork",
    "donegal",
    "dublin",
    "galway",
    "kerry",
    "kildare",
    "kilkenny",
    "laois",
    "leitrim",
    "limerick",
    "longford",
    "louth",
    "mayo",
    "meath",
    "monaghan",
    "offaly",
    "roscommon",
    "sligo",
    "tipperary",
    "waterford",
    "westmeath",
    "wexford",
    "wicklow",
    "king's",
    "queen's"
]


remove_irish = []
for place in irish_cst_n:
  for county in eire_counties:
    if county in place or county == place:
      remove_irish.append(place)


Finally, I can remove all these counties and then remove the variables used for it.

In [None]:
elections_replaced = elections_replaced[~elections_replaced["cst_n"].isin(remove_irish)]
del irish_cst_n, remove_irish, place, county, eire_counties

### Londonberry issue
Printing the Irish counties showed an issue with Londonderry, where the city appears as London**B**erry. The piece of code below will remove all occurrences

I also define the new column, `to_match` where I store values that can then be used for geocoding or other types of matching, thus leaving `cst_n` intact.

In [None]:
# Custom function to replace cells with "londonberry" with "londonderry"
def replace_londonberry(cell_value):
    if "londonberry" in cell_value.lower():
        return "londonderry"
    return cell_value
elections_replaced["to_match"] = elections_replaced["cst_n"].apply(replace_londonberry)

filtered_df = elections_replaced[elections_replaced["cst_n"].str.contains("londonberry", case=False)]
print(filtered_df["to_match"].unique())

['londonderry']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  elections_replaced["to_match"] = elections_replaced["cst_n"].apply(replace_londonberry)


### Removing "county" and "city" from names (Irish places)
Similarly to what I did with the Press dataset, this applies to Irish counties, which are often indicated as "[county_name] county".
The use of "city" after the name of the city is also quite common for Irish cities.

The first step is finding all possible Irish constituencies by matching cells containing " county" or " city" (the leading space is there to avoid possible matches with places containing the letters as part of a longer word).

In [None]:
county_city = list(elections_replaced[elections_replaced["to_match"]
                                      .str.contains(" county| city", case=False, regex = True)]
                                       ["to_match"].unique())

for place in county_city:
  print(place)

antrim county
armagh city
armagh county
down county
fermanagh county
tyrone county
cheshire, city of chester
dublinn county south


Given that there are only a few matches (mostly because many of them have been removed by filtering out Eire counties), I created a dictionary with the replacements.

In [None]:
city_county_replacements = {
    "antrim county" : "antrim",
    "armagh city" : "armagh",
    "armagh county" : "armagh",
    "down county" : "down",
    "fermanagh county" : "fermanagh",
    "tyrone county" : "tyrone",
    "dublinn county south" : "dublin",
    "cheshire, city of chester" : "cheshire"
    }
elections_replaced["to_match"] = elections_replaced["to_match"].replace(city_county_replacements)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  elections_replaced["to_match"] = elections_replaced["to_match"].replace(city_county_replacements)


In [None]:
del county_city, city_county_replacements

### Removing "districts of burghs/boroughs"
In the UK system, the main city of a county could form a city-constituency outside of the county itself, normally called "[city_name] district of boroughs" if in England and "[city_name] district of burghs" if in Scotland.

As usual, I try and match these, then print them out to make sure I am not over-matching, then replace them with the name of the city.

In [None]:
burghs = elections_replaced[elections_replaced["cst_n"].str.contains(" burgh| borough", case=False, regex = True)][["cst_n", "sub"]]
print(burghs["cst_n"].unique())

['ayr district of burghs' 'beaumaris district of boroughs'
 'caernarvon district of boroughs' 'cardiff district of boroughs'
 'cardigan district of boroughs' 'carmarthen district of boroughs'
 'denbigh district of boroughs' 'dumfries district of burghs'
 'elgin district of burghs' 'falkirk district of burghs'
 'flint district of boroughs' 'haddington district of burghs'
 'haverfordwest district of borough' 'iverness district of burghs'
 'kilmarnock district of burghs' 'kirkcaldy district of burghs'
 'leith district of burghs' 'monmouth district of boroughs'
 'montgomery district of boroughs' 'montrose district of burghs'
 'pembroke district of boroughs' 'radnor district of boroughs'
 'st. andrews district of burghs' 'stirling district of burghs'
 'swansea district of boroughs' 'wick district of burghs'
 'wigton district of burghs' 'hawick district of burghs'
 'st andrews district of burghs' 'dumbarton district of burghs'
 'dunfermline district of burghs' 'krikcaldy district of burghs']

In [None]:
elections_replaced["to_match"] = elections_replaced["to_match"].str.replace(
    "district of burghs|district of boroughs|district of burgh|district of borough", '', case=False, regex = True
    ).str.strip()
del burghs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  elections_replaced["to_match"] = elections_replaced["to_match"].str.replace(


### Removing university constituencies
[University constituencies](https://en.wikipedia.org/wiki/University_constituency) are a very special type of constituency, since they do not correspond to a geographical area, but rather to a community, i.e. the alumni and sometimes students of a specific university.

This makes them somewhat irrelevant to the tool since, even back in the 19th century, it would be disingenuous to think that all (or even most) graduates of a specific university would still reside in the area. Moreover, in 1918 most of these were grouped at the national levels, creating constituencies such as "[combined English universities](https://en.wikipedia.org/wiki/Combined_English_Universities_(UK_Parliament_constituency))".

Given all of this, I decided to remove these constituencies altogether: as usual, I try and match these, then print them out to make sure I am not over-matching, then remove them.

In [None]:
university = elections_replaced[elections_replaced["to_match"].str.contains("univers")]
university_sub = elections_replaced[elections_replaced["sub"].str.contains("univers")]
# Running them both is necessary because some cst_n are truncated; moreover,
# while this is not the case in the dataset I first used, it's possible that
# a constituency would not have "university" in the name but still be listed
# as a university constituency (and viceversa)
print(university["to_match"].unique())
print(university_sub["to_match"].unique())

['cambridge university' 'oxford university'
 'aberdeen and glasgow universities' 'edinburgh and st. andrews univers'
 'london university' 'edinbrugh and st andrews universi'
 'combined english universities' 'combined scottish universities'
 'national university of ireland' "queen's university of belfast"
 'university of wales']
['cambridge university' 'oxford university'
 'aberdeen and glasgow universities' 'edinburgh and st. andrews univers'
 'london university' 'edinbrugh and st andrews universi'
 'combined english universities' 'combined scottish universities'
 'university of wales']


In [None]:
elections_replaced = elections_replaced[~elections_replaced["to_match"].str.contains("univers")]
elections_replaced = elections_replaced[~elections_replaced["sub"].str.contains("univers")]

### Creating dictionaries to match constituency to county
Using the matches between district and county and district and map_county in the press dataset, I can try and find matches for the electoral constituencies.

The first step is creating the dictionaries.

In [None]:
#This first dictionary matches press districts to map_counties
district_map_county = dict(zip(press['district'], press['map_county']))
#This second dictionary matches counties to map_counties, since some might use a different naming convention
county_map_county = dict(zip(press['county'], press['map_county']))
# Finally, this is a list of map_counties. If the cst_n in the elections dataset
# matches one of these, it is kept verbatim
press_map_counties = list(press["map_county"].unique())

In [None]:
# This counter tracks how many counties were successfully replaced
success = 0



def map_county_replace():
  success = 0
  for id_number in elections_replaced["id"].unique():
    for index, row in elections_replaced[elections_replaced["id"] == id_number].iterrows():
      constituency = row["to_match"]
      if constituency in press_map_counties:
        elections_replaced.at[index, "map_county"] = constituency
        success += 1
      elif constituency in district_map_county.keys():
        elections_replaced.at[index, "map_county"] = district_map_county[constituency]
        success += 1
      elif constituency in county_map_county.keys():
        elections_replaced.at[index, "map_county"] = county_map_county[constituency]
        success +=1
      else:
        elections_replaced.at[index, "map_county"] = "not_found"
  print(f"{success} successful replacements")

map_county_replace()


7208 successful replacements


## Prepping unmatched places for geocoding
From this moment on, it would already be possible to try and use geocoding to find the locations. However, many rows still contain typos or places that would be easy to match for a human.
To find them, I used an approach in a few steps:

### Try a fuzzy-match approach to manually inspect possible matches
In my case, I had around 1000 unmatched items, which are a relatively short list to skim through, so putting them in a table with their possible fuzzy matches and reading through them takes only a few minutes. The same approach might be trickier for longer lists.

In [None]:
!pip install thefuzz
from thefuzz import fuzz
from thefuzz import process



In [None]:
def fuzzy_match(dataframe, press_map_counties, threshold, shire = False):
  """
  Fuzzy matches the elements of a dataframe column and a list.

  Args:
    dataframe: The dataframe containing a column of elements to match.
    press_map_counties: The list of elements to match against.

  Returns:
    A list of all elements in the column that fuzzy match an element in press_map_counties.
  """

  unmatched = dataframe[dataframe["map_county"] == "not_found"]
  unmatched_places = unmatched[["to_match", "sub"]].drop_duplicates()
  matches = {}
  for index, row in unmatched_places.iterrows():
    unmatched_item = row["to_match"]
    best_match = process.extractOne(unmatched_item, press_map_counties)
    if best_match[1] >= threshold:
      # Adding the sub is useful to understand if the matched place is actually
      # in that county
      matches[unmatched_item] = {"place" : best_match[0], "sub" : row["sub"]}
    # In some cases, the name of the city plus "shire", so that is a relatively
    # easy way to spot possible matches
    if shire == True:
      best_match = process.extractOne(unmatched_item + "shire", press_map_counties)
      if best_match[1] >= threshold:
        matches[unmatched_item] = {"place" : best_match[0], "sub" : row["sub"]}

  return matches

### Creating fuzzy match dictionaries
These dictionaries provide possible matches, they should be inspected before use, especially if any of the underlying datasets have changed.

In [None]:
# This first run uses the optional shire argument to add "shire" at the end
# of the places to match, combined with a relatively high threshold to first find
# most likely replacements
possible_matches = fuzzy_match(elections_replaced, press_map_counties, 91, shire = True)
for key, value in possible_matches.items():
    print(key, ' : ', value["place"], "==", value["sub"])


print(len(possible_matches))

argyll  :  argyllshire == scotland-counties
berwickshre  :  berwickshire == scotland - counties
bute  :  buteshire == scotland-counties
caernarvon  :  caernarfonshire == wales and monmouthshire-boroughs
caernarvonshire  :  caernarfonshire == wales and monmouthshire-counties
cambridge  :  cambridgeshire == england-provincial boroughs
denbigshire  :  denbighshire == wales and monmouthshire - counties
flint  :  flintshire == wales and monmouthshire-boroughs
iverness  :  inverness-shire == scotland-burghs
iverness-shire  :  inverness-shire == scotland-counties
montgomery  :  montgomeryshire == wales and monmouthshire-boroughs
radnor  :  radnorshire == wales and monmouthshire - boroughs
rowburghshire  :  roxburghshire == scotland - counties
lincolnshire, mid  :  lincolnshire == england - counties
herefordshire, ross  :  herefordshire == england-counties
lanarkshire, mid  :  lanarkshire == scotland-counties
northamptonshire, mid  :  northamptonshire == england-counties
staffordshire, leek  :

Once verified this dictionary, I can change it into a replacement dictionary and carry out the replacement with pandas, then I can either create a new one casting a wider net (i.e. a lower fuzzy match threshold) or start the geocoding.

In [None]:
def run_replacement_dictionary(possible_matches):
  for index, row in elections_replaced.iterrows():
    constituency = row["to_match"]
    if constituency in possible_matches.keys():
      elections_replaced.at[index, "map_county"] = possible_matches[constituency]["place"]

run_replacement_dictionary(possible_matches)

## Fuzzy match with lower thresholds

In [None]:
# I remove the optional shire argument and run the code again with a lower
# threshold for the fuzzy match
possible_matches = fuzzy_match(elections_replaced, press_map_counties, 90)
for key, value in possible_matches.items():
    print(key, ' : ', value["place"], "==", value["sub"])


print(len(possible_matches))

In my case, running the code with a lower threshold clearly shows three main matching patterns:

1. Constituencies made up of multiple places concatenated with `and` are matched as one of the counties listed
2. Places identified as `[county], [city]` are matched with `[county]`
3. Constituencies that contain geographical indications such as `western`, `mid` or `west` are matched with the county coming before the indication, in the format `[county], [indication]`

While pattern 2 and 3 could generally be positive, pattern 1 is intrinsically destructive since it will often erase one or more counties in favour of another one.
Given this, I half-manually checked those matches by printing out the matches with " and " in the key and then selecting only those that fitted pattern 1 (for example, I did not exclude cases such as "cornwall, penryn and falmouth").

#### Pattern 1

In [None]:
for key in possible_matches.keys():
  if " and " in key:
    print(f"'{key}',")

'clackmannanshire and kinross-shir',
'elginshire and nairnshire',
'orkney and shetland',
'peeblesshire and selkirkshire',
'peeplesshire and selkirkshire',
'Inverness-shire and Ross and Cromarty, Western Isles',
'Stirlingshire and Clackmannanshire, West Stirlingshire',
'aberdeenshire and kincardineshire',
'ayrshire and bute, bute and north',
'ayrshire and bute, kilmarnock',
'ayrshire and bute, south ayrshire',
'berwickshire and haddingtonshire',
'breconshire and radnorshire',
'caithness and sutherland',
'cheshire, stalybridge and hyde',
'cornwall, penryn and falmouth',
'cumberland, penrith and cockermou',
'glamorganshire, llandaff and barr',
'gloucestershire, forest and dean',
'hampshire, new forest and christc',
'lancashire, heywood and radcliffe',
'lancashire, middleton and prestwi',
'lincolnshire, rutland and stamfor',
'middlesex, brentford and chiswick',
'midlothian and peeblesshire, nort',
'midlothian and peeblesshire, peeb',
'perthshire and kinross-shire, kin',
'perthshire and kin

In [None]:
remove_from_matches = [
  'clackmannanshire and kinross-shir',
  'elginshire and nairnshire',
  'orkney and shetland',
  'peeblesshire and selkirkshire',
  'peeplesshire and selkirkshire',
  'Inverness-shire and Ross and Cromarty, Western Isles',
  'Stirlingshire and Clackmannanshire, West Stirlingshire',
  'aberdeenshire and kincardineshire',
  'ayrshire and bute, bute and north',
  'ayrshire and bute, kilmarnock',
  'ayrshire and bute, south ayrshire',
  'berwickshire and haddingtonshire',
  'breconshire and radnorshire',
  'caithness and sutherland',
  'midlothian and peeblesshire, nort',
  'midlothian and peeblesshire, peeb',
  'perthshire and kinross-shire, kin',
  'perthshire and kinross-shire, per',
  'roxburghshire and selkirkshire',
  'stirlingshire and clackmannanshir',
  'fermanagh and tyrone'
]

def remove_keys(dictionary, keys_to_remove):
  return {key: value for key, value in dictionary.items() if key not in keys_to_remove}

# Now I can remove all matches that represent multiple constituencies
possible_matches = remove_keys(possible_matches, remove_from_matches)


### Patterns 2 and 3
> 2. Places identified as `[county], [city]` are matched with `[county]`
3. Constituencies that contain geographical indications such as `western`, `mid` or `west` are matched with the county coming before the indication, in the format `[county], [indication]`

To check this manually, I do the following:
1. search for a comma in the place name and then split off the first part
2. check if the first part is a county

By doing this, I create two lists that are easier to inspect manually; moreover, the first list contains most likely matches, while the second one merits more scrutiny.
In the end, it looks like all matches are good, so I replace them all.

In [None]:
for key, value in possible_matches.items():
  if key.split(",")[0] in press_map_counties:
    #print(key, ' : ', value["place"], "==", value["sub"])
    continue
  else:
    print(key, ' : ', value["place"], "==", value["sub"])


rutlandshire  :  rutland == england-counties
yorkshire (west riding), northern  :  yorkshire == england - counties
yorkshire (west riding), southern  :  yorkshire == england - counties
yorkshire (west riding), eastern  :  yorkshire == england - counties
antrim east  :  antrim == northern ireland
antrim north  :  antrim == northern ireland
antrim south  :  antrim == northern ireland
antrim mid  :  antrim == northern ireland
armagh mid  :  armagh == northern ireland
armagh north  :  armagh == northern ireland
armagh south  :  armagh == northern ireland
down east  :  down == northern ireland
down north  :  down == northern ireland
down south  :  down == northern ireland
down west  :  down == northern ireland
fermanagh north  :  fermanagh == northern ireland
fermanagh south  :  fermanagh == northern ireland
glamorganshire, eastern  :  glamorgan == wales and monmouthshire-counties
glamorganshire, gower  :  glamorgan == wales and monmouthshire-counties
glamorganshire, mid  :  glamorgan == wa

To carry out the replacement, I create a dictionary based on the `possible_matches` one, which is built like this:

```
{place_tried_to_match: {place: resulting_match}, {sub: elections_dataset_sub}
```



In [None]:
replacements = {}
for key, value in possible_matches.items():
  replacements[key] = value["place"]

elections_replaced["to_match"] = elections_replaced["to_match"].replace(replacements)
map_county_replace()

13398 successful replacements


### Filtering out constituencies referencing multiple places
Finally, I tried to run the fuzzy matching again but I got mostly bad results, so at this point I moved to geocoding. Before that, though, I checked once again for constituencies with an " and " in their name, because multiple names would most likely bear bad geocoding results.

In [None]:
geocoding_dataframe = elections_replaced[elections_replaced["map_county"] == "not_found"]

In [None]:
for place in geocoding_dataframe["to_match"].unique():
  if " and " in place:
    print(f"'{place}',")

'clackmannanshire and kinross-shir',
'elginshire and nairnshire',
'orkney and shetland',
'penryn and falmouth',
'ross and cromarty',
'tynemouth and north shields',
'weymouth and melcombe regis',
'peeblesshire and selkirkshire',
'battersea and clapham i',
'battersea and clapham ii',
'glasgow, backfriars and hutcheson',
'peeplesshire and selkirkshire',
'pembroke and haverfordwest distri',
'tower hamlets,bow and bromley',
'warwick and leamington',
'Inverness-shire and Ross and Cromarty, Western Isles',
'Stirlingshire and Clackmannanshire, West Stirlingshire',
'aberdeenshire and kincardineshire',
'ayrshire and bute, bute and north',
'ayrshire and bute, kilmarnock',
'ayrshire and bute, south ayrshire',
'batley and morley',
'berwickshire and haddingtonshire',
'breconshire and radnorshire',
'caithness and sutherland',
'iverness-shire and ross and croma',
'midlothian and peeblesshire, nort',
'midlothian and peeblesshire, peeb',
'moray and mairnshire',
'nelson and colne',
'perthshire and kinros

Using the results of the block above, I manually sorted the " and " places in thre lists to remove, correctly geocode or match without geocoding each place.

In [None]:
#These should be removed from geocoding because they correspond to constituencies
# made up of multiple counties. Since many of them were created in 1918, it might
# make sense to remove that and the next years
places_to_remove = [
  'clackmannanshire and kinross-shir',
  'elginshire and nairnshire',
  'orkney and shetland',
  'ross and cromarty',
  'peeblesshire and selkirkshire',
  'peeplesshire and selkirkshire',
  'Inverness-shire and Ross and Cromarty, Western Isles',
  'Stirlingshire and Clackmannanshire, West Stirlingshire',
  'aberdeenshire and kincardineshire',
  'ayrshire and bute, bute and north',
  'ayrshire and bute, kilmarnock',
  'ayrshire and bute, south ayrshire',
  'berwickshire and haddingtonshire',
  'breconshire and radnorshire',
  'caithness and sutherland',
  'iverness-shire and ross and croma',
  'midlothian and peeblesshire, nort',
  'midlothian and peeblesshire, peeb',
  'moray and mairnshire',
  'perthshire and kinross-shire, kin',
  'perthshire and kinross-shire, per',
  'roxburghshire and selkirkshire',
  'stirlingshire and clackmannanshir',
  'stirlingshrie and clackmannanshir',
  'fermanagh and tyrone'
]

# These are places that I manually matched to cities, so the geocoding should be
# run, but it should look for the cities
geocoding_replacements = {
  'battersea and clapham i' : "london",
  'battersea and clapham ii' : "london",
  'tower hamlets,bow and bromley': "london",
  'poplar, bow and bromley' : "london",
  'wandsworthm, balham and tooting' : "london",
  'stepney, whitechapel and st georg' : "london",
  'glasgow, backfriars and hutcheson' : "glasgow"
}

# This are manual matches to counties
manual_replacements = {
  'penryn and falmouth' : "cornwall",
  'tynemouth and north shields' : "northumberland",
  'weymouth and melcombe regis' : "dorset",
  'pembroke and haverfordwest distri' : "pembrokeshire",
  'warwick and leamington' : "warwickshire",
  'batley and morley' : "yorkshire",
  'nelson and colne' : "lancashire",
  'stirling and falkirk district of' : "stirlingshire"
}


After using these replacements above, I rerun the geocoding list generator to create an updated list.




In [None]:
elections_replaced["to_match"] = elections_replaced["to_match"].replace(geocoding_replacements)
elections_replaced["to_match"] = elections_replaced["to_match"].replace(manual_replacements)
map_county_replace()

13610 successful replacements


In [None]:
geocoding_dataframe = elections_replaced[elections_replaced["map_county"] == "not_found"]
places_to_geocode = []

# The empty list places_to_geocode will be populated only with the places
# that are still not_found and that are not among those with multiple counties
# in the name
all_places_to_geocode = list(geocoding_dataframe["to_match"].unique())
for place in all_places_to_geocode:
  if place not in places_to_remove:
    places_to_geocode.append(place)

Before geocoding, I run a last fuzzy match round comparing with map counties and districts. This is done because the various cleaning steps I went through, in particular the manual and geocoding replacements, might have generated new matchable strings.

In [None]:
a = 0

last_fuzzy_match = {}

match_list = district_map_county.keys()
for place in sorted(places_to_geocode):
  best_match = process.extractOne(place, press_map_counties)
  if best_match[1] > 91: # This threshold is higher to avoid matching constituencies
  # composed of multiple counties. with a threshold over 90, it will match just
  # the single word, rather than one of the words
    print(place, " ==> ", best_match[0])
    last_fuzzy_match[place] = best_match[0]
  else:

    best_match = process.extractOne(place, match_list)
    if best_match[1] > 89:
      print(place, " ==> ", best_match[0], "==>", district_map_county[best_match[0]])
      last_fuzzy_match[place] = district_map_county[best_match[0]]

    a +=1
# The places for which the fuzzy match is wrong are these

wrong_matches = {
    "bershire, wokingham"  : "berkshire",
    "finsbury" : "london",
    "finsbury , east"  : "london",
    "finsbury, central": "london",
    "finsbury, holborn"  : "london",
    "gatehead" : "ayrshire",
    "middlsex, spelthorne" : "middlesex",
    "newcastle under lyme" : "staffordshire",
    "newcastle-under-lyme" : "staffordshire",
    "rossendale" : "lancashire",
    "westbury" : "wiltshire"
}

# Replace those matches in the original dataset

elections_replaced["to_match"] = elections_replaced["to_match"].replace(wrong_matches)

# Some places are escaped Eire ones, so I removed the rows
escaped_irish = [
    "dublin (pembroke)",
    "ennis",
    "new ross"
]


# Remove both lists of elements from the dictionary of fuzzy match replacements
remove_keys(last_fuzzy_match, escaped_irish)
remove_keys(last_fuzzy_match, wrong_matches.keys())


## Geocoding
Finally, I can move on to cleaning up the dataframe and generate the list of places to geocode, once again excluding the multiple county constituencies.

In [None]:
# Use the fuzzy match replacements
elections_replaced["to_match"] = elections_replaced["to_match"].replace(last_fuzzy_match)

# Check for any new possible match
map_county_replace()

# Generating the queue while taking care of avoiding any multiple county constituency

def generate_geo_queue():
  geocoding_queue = []
  not_found = list(elections_replaced[elections_replaced["map_county"] == "not_found"]["to_match"].unique())
  for place in not_found:
    if place not in places_to_remove:
      geocoding_queue.append(place)
  return geocoding_queue
geocoding_queue = generate_geo_queue()

16653 successful replacements


### Running the geocoding
This process uses the geoapify geocoding API to find coordinates for each location provided. The code filters those by country and creates a dataframe containing:
1. the original location text
2. the query transformed by geoapify and used to find the coordinates
3. if found, the coordinates
4. the level of confidence

In [None]:
import requests
import time
import pandas as pd


timeout = 50
apiKey = "424c140b40554a7eb9998e0a7fd53d1c"
maxAttempt = 20
result = ""

def getLocations(places):
    url = "https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=" + apiKey
    response = requests.post(url, json = places)
    result = response.json()

    # The API returns the status code 202 to indicate that the job was accepted and pending
    status = response.status_code
    if (status != 202):
        print('Failed to create a job. Check if the input data is correct.')
        return
    jobId = result['id']
    getResultsUrl = url + '&id=' + jobId

    time.sleep(timeout)
    result = getLocationJobs(getResultsUrl, 0)
    if (result):
        print('You can also get results by the URL - ' + getResultsUrl)
        return result
    else:
        print('You exceeded the maximal number of attempts. Try to get results later. You can do this in a browser by the URL - ' + getResultsUrl)

def getLocationJobs(url, attemptCount):
    response = requests.get(url)
    result = response.json()
    status = response.status_code
    if (status == 200):
        print('The job is succeeded. Here are the results:')
        return result
    elif (attemptCount >= maxAttempt):
        return
    elif (status == 202):
        print('The job is pending...')
        time.sleep(timeout)
        return getLocationJobs(url, attemptCount + 1)


def find_places(data):
  start = 0
  locations_dict_list = []
  while start <= len(data)-1:
      coordinates = getLocations(data[start:start+50])
      start += 50


      for element in coordinates:
          locations_dictionary  = {}

          locations_dictionary["query"] = element["query"]["text"]
          try:
              locations_dictionary["coordinates"] = f"{element['lat']}, {element['lon']}"
          except:
              locations_dictionary["coordinates"] = ""
          try:
              locations_dictionary["confidence"] = float(element["rank"]["confidence"])
          except:
              locations_dictionary["confidence"] = 0
          try:
              locations_dictionary["country"] = element["country"]
          except:
              locations_dictionary["country"] = None
          locations_dict_list.append(locations_dictionary)
  return pd.DataFrame.from_dict(locations_dict_list)

locations_dataframe = find_places(geocoding_queue)

The job is succeeded. Here are the results:
You can also get results by the URL - https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=424c140b40554a7eb9998e0a7fd53d1c&id=d5a9ae0e4c1e4f1fb05fd3a40b5878c7
The job is succeeded. Here are the results:
You can also get results by the URL - https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=424c140b40554a7eb9998e0a7fd53d1c&id=6740aa1898d34b02a8099224639ba704
The job is succeeded. Here are the results:
You can also get results by the URL - https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=424c140b40554a7eb9998e0a7fd53d1c&id=cc05aa3b42634c55bcba21f0d9a158c9


From here, I can visualise the data and check possible typos or other reasons that would prevent high confidence matches.

Locations that were not found at all (or with zero confidence) are most likely typos.

In [None]:
# These are typos found by looking over the data

typos = {'birminghman, central': 'birmingham',
 'birminghman, edgbaston': 'birmingham',
 'clasgow, gorbals': 'glasgow',
 'durgavan': 'dungarvan',
 'edingburghshire': 'edinburghshire',
 'endinburghshire': 'edinburghshire',
 'glamoranshire, ogmore': 'glamorgan',
 'gloamorganshire, southern': 'glamorgan',
 'glomorganshire, rhondda': 'glamorgan',
 'krikcaldy': 'kirkcaldy',
 'linconlnshire, northern': 'lincolnshire',
 'liverppol, west toxteth': 'liverpool',
 'norfold, northern': 'norfolk',
 'prtsmouth, central': 'portsmouth',
 'st george, hanover square': 'london',
 'st pancras, south': 'london',
 'stafforshire, northern': 'staffordshire',
 'thrisk': 'thirsk'}

elections_replaced["to_match"] = elections_replaced["to_match"].replace(typos)

# Check if the replacement brought along new matches
map_county_replace()

# Re-enerating the queue with new matches
geocoding_queue = generate_geo_queue()

16850 successful replacements


With the higher quality list of places, I can rerun the geocoding

In [None]:
locations_dataframe = find_places(geocoding_queue)

The job is succeeded. Here are the results:
You can also get results by the URL - https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=424c140b40554a7eb9998e0a7fd53d1c&id=a4fdb99bb0e74d9a877980ac15da8a48
The job is succeeded. Here are the results:
You can also get results by the URL - https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=424c140b40554a7eb9998e0a7fd53d1c&id=603a146d85604d4e9ed35bdb9c6edbd5
The job is succeeded. Here are the results:
You can also get results by the URL - https://api.geoapify.com/v1/batch/geocode/search?bias=countrycode:gb,ie&apiKey=424c140b40554a7eb9998e0a7fd53d1c&id=d82821f162e543f9af7b21d9365a5216


### Removing Eire places - part 2

Thanks to the geocoding response, I can filter out places that are in Eire rather than in the UK.

In [None]:
irish_geolocated = list(locations_dataframe[locations_dataframe["country"] == "Ireland"]["query"].unique())

irish_geolocated += [
    "athlone",
    "bandon",
    "carow",
    "clonmel",
    "dundalk",
    "kinsale",
    "mallow",
    "portarlington",
    "tralee",
    "dublin (st. james's)"
]

elections_replaced = elections_replaced[~elections_replaced["to_match"].isin(irish_geolocated)]
locations_dataframe = locations_dataframe[~locations_dataframe["query"].isin(irish_geolocated)]

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()


In [None]:
for county in sorted(press_map_counties):
  print(county)

aberdeenshire
anglesey
angus
antrim
argyllshire
armagh
ayrshire
banffshire
bedfordshire
berkshire
berwickshire
brecknockshire
buckinghamshire
buteshire
caernarfonshire
caithness
cambridgeshire
cardiganshire
carmarthenshire
cheshire
clackmannanshire
cornwall
cumberland
denbighshire
derbyshire
devon
dorset
down
dumfriesshire
dunbartonshire
durham
east-lothian
essex
fermanagh
fife
flintshire
glamorgan
gloucestershire
hampshire
herefordshire
hertfordshire
huntingdonshire
inverness-shire
kent
kincardineshire
kinross-shire
kirkcudbrightshire
lanarkshire
lancashire
leicestershire
lincolnshire
londonderry
merionethshire
middlesex
midlothian
monmouthshire
montgomeryshire
morayshire
nairnshire
norfolk
northamptonshire
northumberland
nottinghamshire
orkney
oxfordshire
peeblesshire
pembrokeshire
perthshire
radnorshire
renfrewshire
ross-shire
roxburghshire
rutland
selkirkshire
shetland
shropshire
somerset
staffordshire
stirlingshire
suffolk
surrey
sussex
sutherland
tyrone
warwickshire
west-lothian


### Matching coordinates to areas

Using the shapely library, I check if the coordinates I found belong to one of the historical counties of the map.
To do so, I import the GeoJson map coming from the Historical County Border Project.

In [None]:
import json
from shapely.geometry import shape, GeometryCollection, Point
import pandas as pd



locations_dataframe[['latitude', 'longitude']] = locations_dataframe['coordinates'].str.split(',', expand=True).astype(float)



with open('input/updated_map.json', 'r') as f:
    map = json.load(f)




for feature in map['features']:

    polygon = shape(feature['geometry'])

    for index, row in locations_dataframe.iterrows():
        if row["query"] not in irish_geolocated:
          point = Point(row["longitude"], row["latitude"])

          if polygon.contains(point):
              locations_dataframe.at[index, "map_county"] = feature["properties"]["NAME"]
              #print ('Found containing polygon:', feature["properties"]["NAME"])


county_mapping = dict(zip(locations_dataframe['query'], locations_dataframe['map_county']))

In [None]:
manual_replacements = {'argyll' : 'argyllshire',
    'beaumaris' : 'anglesey',
    'bewdley' : 'worcestershire',
    'bute' : 'buteshire',
    'cambridge' : 'cambridgeshire',
    'cricklade' : 'wiltshire',
    'eye' : 'northamptonshire',
    'flint' : 'flintshire',
    'great marlow' : 'buckinghamshire',
    'lambeth' : 'middlesex',
    'marylebone' : 'middlesex',
    'montgomery' : 'montgomeryshire',
    'southwark' : 'middlesex',
    'tower hamlets' : 'middlesex',
    'westminster' : 'middlesex',
    'woodstock' : 'oxfordshire',
    'chelsea' : 'middlesex',
    'hackney' : 'middlesex',
    'aston manor' : 'warwickshire',
    'bethnal green, north-east' : 'middlesex',
    'bethnal green, south-west' : 'middlesex',
    'caernarvonshire, arfon' : 'caernarfonshire',
    'caernarvonshire, eifion' : 'caernarfonshire',
    'camberwell, dulwich' : 'middlesex',
    'camberwell, north' : 'middlesex',
    'camberwell, peckham' : 'middlesex',
    'fulham' : 'middlesex',
    'hackney, central' : 'middlesex',
    'hackney, north' : 'middlesex',
    'hackney, south' : 'middlesex',
    'hammersmith' : 'middlesex',
    'hampstead' : 'middlesex',
    'islington, east' : 'middlesex',
    'islington, north' : 'middlesex',
    'islington, south' : 'middlesex',
    'islington, west' : 'middlesex',
    'kesington, north' : 'middlesex',
    'kesington, south' : 'middlesex',
    'lambeth, brixton' : 'middlesex',
    'lambeth, kennington' : 'middlesex',
    'lambeth, north' : 'middlesex',
    'marylebone, east' : 'middlesex',
    'marylebone, west' : 'middlesex',
    'newington, walworth' : 'middlesex',
    'newington, west' : 'middlesex',
    'paddington, north' : 'middlesex',
    'paddington, south' : 'middlesex',
    'shoreditch, haggerston' : 'middlesex',
    'shoreditch, hoxton' : 'middlesex',
    'southwark, bermondsey' : 'middlesex',
    'southwark, rotherhithe' : 'middlesex',
    'southwark, west' : 'middlesex',
    'st pancras, east' : 'middlesex',
    'st pancras, north' : 'middlesex',
    'st pancras, west' : 'middlesex',
    'strand' : 'middlesex',
    'tower hamlets, limehouse' : 'middlesex',
    'tower hamlets, mile end' : 'middlesex',
    'tower hamlets, poplar' : 'middlesex',
    'tower hamlets, st george' : 'middlesex',
    'tower hamlets, stepney' : 'middlesex',
    'tower hamlets, whitechapel' : 'middlesex',
    'battersea, north' : 'middlesex',
    'battersea, south' : 'middlesex',
    'bermondsey, rotherhithe' : 'middlesex',
    'bermondsey, west' : 'middlesex',
    'camberwell, north-west' : 'middlesex',
    'fulham, east' : 'middlesex',
    'fulham, westr' : 'middlesex',
    'hammersmith, north' : 'middlesex',
    'hammersmith, south' : 'middlesex',
    'holborn' : 'middlesex',
    'hornsey' : 'middlesex',
    'kensington, north' : 'middlesex',
    'kensington, south' : 'middlesex',
    'poplar, south poplar' : 'middlesex',
    'rhondda, east' : 'glamorgan',
    'rhondda, west' : 'glamorgan',
    'sheffiled, attercliffe' : 'yorkshire',
    'shoreditch' : 'middlesex',
    'southwark, central' : 'middlesex',
    'southwark, north' : 'middlesex',
    'southwark, south-east' : 'middlesex',
    'st marylebone' : 'middlesex',
    'st pancras, south-east' : 'middlesex',
    'st pancras, south-west' : 'middlesex',
    'stepney, limehouse' : 'middlesex',
    'stepney, mile end' : 'middlesex',
    'stoke newington' : 'middlesex',
    'westminster, abbey' : 'middlesex',
    "westminster, st george's" : 'middlesex'
}



elections_replaced["to_match"] = elections_replaced["to_match"].replace(manual_replacements)

map_county_replace()

18292 successful replacements


In [None]:
failure = elections_replaced[elections_replaced["map_county"] == "not_found"]
failure_list = list(failure["to_match"].unique())

for element in failure_list:
  print(element)

clackmannanshire and kinross-shir
elginshire and nairnshire
orkney and shetland
ross and cromarty
peeblesshire and selkirkshire
peeplesshire and selkirkshire
Inverness-shire and Ross and Cromarty, Western Isles
Stirlingshire and Clackmannanshire, West Stirlingshire
aberdeenshire and kincardineshire
ayrshire and bute, bute and north
ayrshire and bute, kilmarnock
ayrshire and bute, south ayrshire
berwickshire and haddingtonshire
breconshire and radnorshire
caithness and sutherland
galloway
iverness-shire and ross and croma
midlothian and peeblesshire, nort
midlothian and peeblesshire, peeb
moray and mairnshire
perthshire and kinross-shire, kin
perthshire and kinross-shire, per
roxburghshire and selkirkshire
stirlingshire and clackmannanshir
stirlingshrie and clackmannanshir
fermanagh and tyrone


# Final check
Since we created a separate column for modified constituency names (`to_match`) and another one for the matched counties (`map_county`), we can now check every replacement we operated to see if there are any obvious mistakes

In [None]:
control_dataframe = elections_replaced[["cst_n", "map_county"]].drop_duplicates()
control_dataframe

Unnamed: 0,cst_n,map_county
0,aberdeen,aberdeenshire
2,aberdeenshire,aberdeenshire
3,abingdon,berkshire
5,andover,hampshire
9,anglesey,anglesey
...,...,...
19314,antrim,antrim
19318,armagh,armagh
19661,down,down
19757,fermanagh and tyrone,not_found


In [None]:
map_counties = []
for feature in map['features']:
   map_counties.append(feature["properties"]["NAME"].lower())

# This part finds out the only county without a match, cromartyshire
map_counties = sorted(list(set(map_counties)))
for element in map_counties:
  if element.lower() not in press_map_counties:
    print(element)


for element in press_map_counties:
  if element not in map_counties:
    print(element)

cromartyshire


In [None]:
with open("pickles/elections_cleaned.pkl", "wb") as f:
  pickle.dump(elections_replaced, f)