# Press directories cleaning pipeline

The original dataset called `PressDirectories.csv` comes from the [Living with machines project](https://github.com/Living-with-machines/PressDirectories).

The dataset contains Wikidata ID for each county (sometimes multiple), so those can also be used to match counties to the map of historical counties (although that requires some matching on the "other side", i.e. matching map labels to the Wikidata ID).

## Setting the working directory

If you are running this notebook on Google Colab, you can use the code below to mount your Google Drive as the working directory.


```
from google.colab import drive
drive.mount('/content/drive')
```

The file structure for this and the other notebooks should be the following:
```
├── Main working directory
│   ├── pickles (folder)
│   │   ├── elections_cleaned.pkl (generated by the second cleaning file)
│   │   ├── *cleaned_press_directories.pkl (created by this file)
│   ├── input (folder)
│   │   ├── *PressDirectories.csv (necessary for this file)
│   ├── output (folder)
│   ├── index.html
│   ├── main.js
│   ├── style.css
```



## Loading the Press directories data

The dataset is relatively small, so it can be loaded each time without the need to store it otherwise.

In [None]:
import pandas as pd
press_directories = pd.read_csv("input/PressDirectories.csv")

press_directories = press_directories.reset_index(drop = True)

# This way an original version of the file is stored
# in the press_directories variable
df = press_directories.copy()


## Removing the Eire counties
The map used to plot the data contains only the UK historical counties, so the ones belonging to the current Republic of Ireland have to be filtered out.

To do so, I employed a mix of literal and regex replacements, checking that they would not overextend and delete counties that are not in Eire. Whenever the original dataset is updated, this check should happen again to make sure that there are no ambiguous matches.

### 1. regex matching
Many Irish counties (both for current Eire and Northern Ireland) are written like this:
> in the province of ulster and county armagh

When the name of the county is simply Armagh (or Armagh county). Before moving on with matching them literally, I used a regex pattern to find all these occurrences and replace them with the name of the county. This affects all Irish counties in general, not just Eire ones.

In [None]:
# Define the regular expression pattern

pattern = r'in the province of ([\w\s\']+?) and county ([\w\s\']+)'



# Function to replace the matched phrases
def replace_phrase(match):
    county_name = match.group(2)
    if county_name[:3] == "of ":
        county_name = county_name[3:]
    if 'county' in county_name:
        return county_name
    else:
        return county_name + ' county'

original_column = df["county"]


# This replacement is done using str.replace, which passes the regex match to
# a function that further elaborates it
df["map_county"] = df["county"].str.replace(pattern, replace_phrase, regex=True)

counties = list(df["map_county"].unique())

replaced_column = df["map_county"]

# Dictionary containing all replaced counties
counties_replaced = {}


# Store changes
for original, replaced in zip(original_column, replaced_column):
    if original != replaced:
        counties_replaced[original] = replaced

### 2. Finding counties slipping through

Some counties escape the regex pattern because of how they are called, often just because they use a different word order.
In some cases, writing a regex for them could be dangerous because it could match unwanted rows, so I used a dictionary instead.

Once again, these replacements are based on my survey of the data at the moment of writing, and might need changing in the future.
The block of code below serves as a guide to see what counties are in the dataset.

In [None]:
def escaping():
  search_counties_escaping = list(df["map_county"].unique())
  for county in search_counties_escaping:
    print(county)

  # Variables are deleted after execution since they have no particular use
  del search_counties_escaping
  del county

escaping()

london
sussex
cambridgeshire
kent
cumberland
essex
gloucestershire
derbyshire
durham
devonshire
hertfordshire
warwickshire
yorkshire
leicestershire
norfolk
lancashire
northumberland
staffordshire
lincolnshire
cheshire
hampshire
berkshire
worcestershire
carmarthenshire
aberdeenshire
forfarshire
ayrshire
fifeshire
dumfriesshire
edinburghshire
lanarkshire
inverness-shire
roxburghshire
nainshire
perthshire
antrim county
londonderry county
galway county
limerick county
longford county
tipperary county
monmouthshire
cornwall
carlow county
clare county
kilkenny county
meath county
dorsetshire
clackmannanshire
in the province of leinster and queen 's county
in the province of leinster and king 's county
in the province of leinster and king's county
in the province of connaught and the county of sligo
bedfordshire
somersetshire
suffolk
wiltshire
shropshire
kinross-shire
herefordshire
pembrokeshire
argyllshire
oxfordshire
in the province of ulster and the county of armagh
banffshire
flintshire
p

In [None]:
# These are replacements that are not as easily captured by simple regex, so
# I preferred to spell them out to make it more transparent

escaped_replacements =  {
    # Queen's and King's county (Offaly and Laois in other naming conventions)
    "in the province of leinster and queen 's county" : "queen's county",
    "in the province of leinster and king 's county" : "king's county",
    "in the province of leinster and king's county" : "king's county",
    # Sligo
    'in the province of connaught and the county of sligo' : "sligo county",
    # Armagh
    'in the province of ulster and the county of armagh' : "armagh county",
    'in the province of ulster and the county armagh' : "armagh county",
    # Antrim
    'in the province of ulster and co . antrim' : "antrim county",
    'in the prov. of ulster and co. antrim' : 'antrim county',
    # Down
    'in county down , province of ulster' : "down county",
    # Tyrone
    'in the county of tyrone and province of ulster' : 'tyrone county',
    'co . tyrone' : "tyrone county",
    # Clare
    'in the prov . of munster and co . clare' : 'clare county',
    # Wicklow
    'county wicklow and province of leinster' : 'wicklow county',
    # Leitrim
    'prov . of connaught & county leitrim' : 'leitrim county',
    'co . leitrim' : "leitrim county",
    # Cork
    "county cork" : "cork county",
    # This would actually be both, but I am taking it out of the dataset anyway
    # since it's in Eire
    "in the province of leinster , and between counties meath and  louth" : "meath county",
    # Donegal
    "county donegal" : "donegal county",
    # Down
    "county down" : "down county",
    # Fermanagh
    "county fermanagh" : "fermanagh county"
    }


# Replace values using the dictionary
df["map_county"] = df["map_county"].replace(escaped_replacements)

# Variable removed because it's not useful afterwards
del escaped_replacements

### 3. Removing the Eire counties

In [None]:
# List counties in the current Republic of Ireland
eire_counties = [
    "carlow",
    "cavan",
    "clare",
    "cork",
    "donegal",
    "dublin",
    "galway",
    "kerry",
    "kildare",
    "kilkenny",
    "laois",
    "leitrim",
    "limerick",
    "longford",
    "louth",
    "mayo",
    "meath",
    "monaghan",
    "offaly",
    "roscommon",
    "sligo",
    "tipperary",
    "waterford",
    "westmeath",
    "wexford",
    "wicklow",
]


# These two correspond to laois and offaly county in a different naming convention
eire_counties += ["king's", "queen's"]


eire_counties_county = [county + " county" for county in eire_counties]

# Filter out all counties in the list
df = df[~df["map_county"].isin(eire_counties_county)]
press_counties = list(df["map_county"].unique())

### 4. Removing "county" from names
This applies to Irish counties, which in the Press Directories dataset are often called in a complex way which, after the previous cleaning, still keeps the original "county" in. This was done to make it clearer which are Irish (Eire or NI) counties up to this point, but now they can be formatted as the others.

In [None]:
df["map_county"] = df["map_county"].str.replace(" county", '', case=False).str.strip()


### 5. Removing unnecessary variables

In [None]:
del eire_counties, eire_counties_county, original, original_column, pattern, replaced, replaced_column

## Removing unrepresented counties
For various reasons, some places indicated as counties in the Press Directories dataset did not (and still don't) elect representatives to the UK House of Commons. Given the fact that the analysis is based on comparing UK general election results to press leanings, I removed them since they would offer no comparison.

In [None]:
df = df[~df["map_county"].isin(["guernsey", "isle of man", "jersey"])]

## Matching counties to historic counties
The final tool uses UK's historic counties to map out the Press Directories and Elections dataset, so I made sure they would match.

In [None]:
# These replacements are based on manually matching the name of counties with
# historic ones, so they might be wrong and should be re-inspected upon re-use


# The commented out ones conflict with the map names

manual_replacements =  {
    "london" : "middlesex", # this is somewhat debatable and london should
    # maybe be removed altogether, given how specific its situation was already
    "caithness-shire" : "caithness",
    "carnarvonshire" : "caernarfonshire",
    "devonshire" : "devon",
    "dorsetshire": "dorset",
    "dumbartonshire" : "dunbartonshire",
    "edinburghshire" : "midlothian",
    "elginshire" : "morayshire",
    "elgin" : "morayshire",
    "fifeshire" : "fife",
    "forfarshire" : "angus",
    "glamorganshire" : "glamorgan",
    "greenock" : "renfrewshire",
    "haddingtonshire" : "east-lothian",
    "isle of anglesey" : "anglesey",
    "isle of bute" : "buteshire",
    "isle of wight" : "hampshire",
    "kinross" : "kinross-shire",
    "linlithgowshire" : "west-lothian",
    "nainshire" : "nairnshire",
    "salop" : "shropshire", # there is already a shropshire, the overlap is due
    # to naming changes through time
    "shetland isles" : "shetland",
    "somersetshire" : "somerset"
}



# Replace values using the dictionary
df["map_county"] = df["map_county"].replace(manual_replacements)



# Saving and pickling
The dataset exported from here is the `cleaned_press_directories` one, which is also saved as a pickle to avoid having to rerun everything if nothing has changed.


In [None]:
cleaned_press_directories = df.copy()
# Save the cleaned_press_directories DataFrame as a pickle file
cleaned_press_directories.to_pickle("pickles/cleaned_press_directories.pkl")