# Evaluating rules of thumb for selecting target seats
### _Determinants of success for progressive challengers in the 2024 UK General Election_



## Background

The Liberal Democrats (LDs) use three high level measures of a constituency's viability as a potential target seat, derived from results of the most recent general election:

1. the position achieved by the LD candidate (e.g. 2nd, 3rd, 4th) - which I will call 'position'
2. the difference between the LD candidate's number of votes won and the winning candidate's number of votes won - which I will call 'raw vote margin'
3. the difference between the LD candidate's percentage share of votes cast and the winning candidate's percentage share of votes cast - which I will call 'percentage margin'.

In this analysis I seek to determine which of these is the better predictor of whether a non-incumbent candidate will win the constituency at the next election.

Why does the distinction matter? In most cases a party that came second in a constituency will also have a narrow margin to overturn (in both raw and percentage terms). However, close-run elections can result in the party in third or fourth position in the constituency facing only a narrow margin between their result and the winner's. Conversely, a seat can be won so decisively that the party in second place faces a large margin. 

Take the two together and one can have a situation in which one party can face a small margin from third place in one seat, and a large margin from second place in another. 

Take for example two Liberal Democrat results in the 2024 General Election. In the constituency of Exmouth and Exeter East, the Lib Dems came third, but since it was a close race between three parties, they now face a margin of only 3.26% of the vote between their result and the winner's result. Conversely, in Cambridge, the Lib Dems came second, but face a margin of 13.16%. Which is the better seat to target? That is what this analysis seeks to answer.

## Important Limitations

First, it is important to note that many other factors are included in the selection of target seats, such as number of LD councillors, level of fundraising, qualitative sense of the quality of the party's candidate in that seat, and so on. I hope to analyse these other factors in future, once I have compiled the necessary data. They will not be considered in this analysis.

Second, it is important to acknowledge that political parties will very likely have incorporated one or more of these same factors into their determination of which seats to target - which will in turn have affected the results in those seats. As a consequence, a regression result showing that party position explained more variance in outcomes than percentage margin did, for instance, could very well simply indicate that parties concentrated their resources into seats in which they held favourable positions, in preference over seats where they were within close percentage margins of victory.

To control for this effect, I will incorporate data on which seats were understood to have been targetted by three of the parties: Labour, the Liberal Democrats, and the Greens. I am grateful to the tactical voting campaign, StopTheTories.vote, for providing their data on those three parties' target seats. 

Since that data does not include targetting by other parties, I will limit my analysis in this instance to those three parties. I will need to consider the impact upon these three parite sIn turn, since these three parties face more opponents in both Wales and Scotland (whether the nationalist parties compete for their votes), and can arguably face an easier contest in seats where Reform splits the right-wing vote with the Conservatives, 


## Literature

The President of the Liberal Democrats, Lord Pack, and I have recently authored a proposed strategy for the party, entitled 'What Next for the Liberal Democrats?', and available at this url: https://docs.google.com/document/d/11aVzII74yXZ9GaneBXK-_nIHP_ow72guAiiZiRfNFEY. In that document, we use a mixture of the three metrics, since each of them has currency within the party. To help take the strategy forwards, I now seek to determine which is the most useful of the metrics. 

The academic literature includes analysis of these factors, and broadly supports percentage margin being the most important of the three. However, many studies are drawn either from two-party systems (most notably the US) or from proportional systems, such as those in continental Europe. Examples include Jacobson's _The Politics of Congressional Elections_, 2015, which argues for percentage margin being the most important of the factors, within the two party system of US politics, and Bartolini and Mair's _Identity, Competition, and Electoral Availability: The Stabilisation of European Electorates 1885–1985_, 1990, which reaches a similar conclusion about the proportional systems on the continent.

The literature for majoritarian multi-party systems, such as those in use in Britain and Canada, appears to indicate that finishing second might be more important, however. Both the seminal 1969 _Political Change in Britain: The Evolution of Electoral Choice_, by Butler and Stokes, and the 2006 volume _Putting Voters in Their Place: Geography and Elections in Great Britain_ by Johnston & Pattie conclude that coming second is more important than obtaining a narrow percentage margin. 

Since the Liberal Democrats use a mixture of the measures, there is clear scope for further analysis in the British context.



## Research Design

In this first analysis, I will consider each unique 'campaign' in the 2024 election - that is, each unique combination of a political party and a constituency. (Future analysis will include earlier elections.) Whether or not the candidate won the in the constituency in the 2024 general election will be the primary dependent variable of analysis, since winning the seat is the primary consideration of whether or not to target it. Given the circular nature of causality here (parties are likely to target seats they think they can win, so it is possible that we will see outsized effects), I will repeat the analysis for a secondary dependent variable: the resulting percentage vote share of the candidate, which will provide more nuanced indications of the effects of each of the factors under consideration.

As such, we will have two effect sizes to consider for each of the three possible factors: the effect upon the likelihood of winning the seat in 2024 (to be calculated using logistic regression analysis) and the effect upon the percentage vote share achieved in 2024 (to be calculated using Ordinary Least Squares regression.)

For convenience, I will refer to these two dependent variables as the 'outcome' of a campaign in this narrative, including in the hypotheses below. When conducting the analysis and interpreting the findings, however, I will separately evaluate the two dependent variables: whether or not the candidate won the constituency and what percentage vote share they achieved. 


### Hypotheses:

- H1. All three selected factors will be statistically discernible predictors of the outcome of a campaign.
    - H0. Null Hypothesis: one or more of the three selected factors will _not_ be a statistically discerinble predictor of the outcome of the campaign.
- H2. Difference in 2019 percentage vote share will be a stronger predictor of the campaign outcome than difference in 2019 number of votes.
    - H0. Null Hypothesis: difference in 2019 percentage vote share will _not_ be a stronger predictor of the campaign outcome than difference in 2019 number of votes.
- H3. Whether the campaign achieved second place in the 2019 election will be a stronger predictor of the 2024 campaign outcome than either other factor.
    - H0. Null hypothesis: whether the campaign achieved second place in the 2019 election will _not_ be a stronger predictor of the 2024 campaign outcome than either other factor.

As indicated above, I will evaluate these hypotheses for each of the two dependent variables in question.

### Data Sources:

To test these hypotheses, I will need to bring together three data sources:

1. 2024 general election results by constituency - produced by the House of Commons library
2. 2024 general election results by candidate - produced by the House of Commons library
3. Estimates of notional 2019 general election results by constituency - produced by Rallings & Thrasher of Nuffield College, Oxford

(Note also that a party might have stood different candidates in the same constituency in each of the two elections, so ideally I would also take into account data about 2019 candidates, rather than just 2024 candidates. However, the notional 2019 Rallings & Thrasher data only covered results per constituenncy, not per candidate, so I would not be able to establish a match between 2019 and 2024 constituencies.)


### Exclusions:

To ensure valid conclusions can be drawn regarding predictors of the outcomes of non-incumbent campaigns, I will need to remove: 

1. all campaigns by incumbent MPs
2. all campaigns by parties other than Labour, the Liberal Democrats and the Greens, for whom I have data on their choices of target seat
3. all campaigns in seats not contested by these parties: to whit, the seats in Northern Ireland
2. all campaigns in the Speaker's constituency, Chorley, which is not contested by the major parties
3. all campaigns in constituencies where a Parliamentary by-election was held between the 2019 and 2024 general elections (because in these the 2019 notional election results will not be valid)

Unlike in the previous analysis, I will not need to remove constituencies in Northern Ireland, because this current analysis is not specific to the performance of the Liberal Democrats (who do not run in those constituencies).


### Controls:

Many other factors could credibly contribute to the outcome of a campaign, including characteristics of the constituency and its population; the candidate; the party; and the campaign itself. Examples of those are as follows:

- The constituency & its population: rural vs urban, economic classification, levels of education, levels of employment, whether it is targetted by each party etc
- The candidate: gender, age, whether or not resides in the constituency, etc
- The national party: support in opinion polls, approval ratings for leader, national expenditure etc.
- The local party: number of councillors, results in local elections, number of members, etc.
- The campaign: quality of message, number of volunteers, local expenditure, volume of literature delivered, etc

Over future analyses I would like to establish the explanatory power of a number of these variables. For this initial analysis, I will seek to control for them, as potentially influential factors on the outcome of each campaign. In this initial analysis, I will limit myself to controlling for those variables that are provided in the House of Commons official results datasets. For future analysis I will source additional data with which to both test and control for other considerations.

From the house of Commons official results data, I will use the following as controls:

- Candidates:
    1. Gender
    2. Former MP: whether or not the candidate has ever previously been a candidate
    3. Party
- Constituency:
    1. Party who won the seat in 2019
    2. Number of credible candidates campaigning (to be determined by whether the candidate lost their deposit by securing 5% or less of the vote)
    3. Region of constituency
    4. Type of constituency (borough / county) - noting that Scottish boroughs are coded as 'burgh', so will need to be consolidated with boroughs.

Additionally, from the StopTheTories.vote data, I will use their measures of:

- Target Seats
    1. Whether the seat was a target for Labour
        - ... for the Greens,
        - ... for the Liberal Democrats,
        - ... of for none of the above.

For each of the above, my rationale is as follows:

- Candidates
    1. Gender of candidate. Given that only 40% of Parliamentarians are female, it is possible that two otherwise identical candidates might receive differing numbers of votes based on their gender alone.

    2. Whether candidate is former MP. Former MPs could be expected to have greater name recognition than other candidates.

    3. Party of candidate. The 2024 election was characterised by anti-Conservative sentiment. We could expect Conservative candidates to perform worse than other candidates.

- Constituencies

    1. Winning party in 2019. As above, the 2024 general election was characterised by anti-Conservative sentiment. Challenger candidates running against incumbent Conservatives could be expected to perform better than otherwise identical candidates running against other incumbents.

    2. Number of credible candidates campaigning in 2024. The effect of the number of candidates upon the dependent variable could be expected to be limited by the number of candidates who were credible contenders. To determine credibility, we will use the existing measure of whether a candidate lost their deposit (triggered by achieving 5% or less of the vote). 

    3. Region of constituency. Many political parties obtain better results in some regions of the UK than others (such as Labour in the North of England), and many target their efforts towards particular regions (such as the Liberal Democrats in the South West and South East of England). The 'Region' variable also includes the country, so covers similar effects between Wales, Scotland, Northern Ireland and the regions of England.

    4. Type of constituency. Some political parties traditionally perform better in urban areas (e.g. Labour) and others traditionally perform better in rural areas (e.g. the Conservatives). 

- Target Seats

    1. A party that targetted a seat will have invested more resources into that campaign (such as advertising spending, person-hours of campaigning by activists, visits by high-profile politicians etc), making it significantly more likely to win.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as stats
import scipy as sc


## Data Wrangling

### Create main dataframe 
_Using 2024 results by candidate as a foundation, since each row is a unique constituency campaign_

In [2]:
# Ingest 2024 candidate data and label as main dataset

main = pd.read_csv("data/HoC-GE2024-results-by-candidate.csv")

main.shape # establish number of rows

(4515, 22)

In [3]:
main.columns # determine column names

Index(['ONS ID', 'ONS region ID', 'Constituency name', 'County name',
       'Region name', 'Country name', 'Constituency type', 'Party name',
       'Party abbreviation', 'Electoral Commission party ID', 'MNIS party ID',
       'Electoral Commission adjunct party ID', 'Candidate first name',
       'Candidate surname', 'Candidate gender', 'Sitting MP', 'Former MP',
       'Member MNIS ID', 'Votes', 'Share', 'Change', 'DC Person ID'],
      dtype='object')

In [4]:
# Remove candidates in the Speaker's constituency, Chorley

main = main[main['Constituency name'] != 'Chorley']

In [5]:
# Remove candidates in the Northern Irish constituencies (where not all of Lab, LD and Green compete)

main = main[main['Region name'] != 'Northern Ireland']

In [6]:
# Remove incumbent candidates

main = main[main['Sitting MP'] != 'Yes']

In [7]:
# Identify the abbrevations used for Labour, Liberal Democrats and Greens

main['Party abbreviation'].value_counts()

Party abbreviation
Green    617
LD       615
RUK      608
Lab      459
Ind      444
        ... 
SUN        1
KIRG       1
LCI        1
NHAP       1
NIP        1
Name: count, Length: 86, dtype: int64

In [8]:
# Remove campaigns run by parties other than Labour, Liberal Democrats and the Greens

campaigns_to_keep = ['Green', 'Lab', 'LD']

main = main[main['Party abbreviation'].isin(campaigns_to_keep)]

main # print the top and bottom of the dataframe for visual validation

Unnamed: 0,ONS ID,ONS region ID,Constituency name,County name,Region name,Country name,Constituency type,Party name,Party abbreviation,Electoral Commission party ID,...,Candidate first name,Candidate surname,Candidate gender,Sitting MP,Former MP,Member MNIS ID,Votes,Share,Change,DC Person ID
2,W07000081,W92000004,Aberafan Maesteg,,Wales,Wales,County,Liberal Democrat,LD,PP90,...,Justin,Griffiths,Male,No,No,,916,0.025619,-0.011412,89373
3,W07000081,W92000004,Aberafan Maesteg,,Wales,Wales,County,Green,Green,PP63,...,Nigel,Hill,Male,No,No,,1094,0.030597,0.014817,92890
11,S14000060,S92000003,Aberdeen North,,Scotland,Scotland,Burgh,Scottish Green Party,Green,PP130,...,Esme,Houston,Female,No,No,,1275,0.030289,0.018080,85482
12,S14000060,S92000003,Aberdeen North,,Scotland,Scotland,Burgh,Liberal Democrat,LD,PP90,...,Desmond,Bouse,Male,No,No,,2583,0.061361,-0.015942,74392
15,S14000060,S92000003,Aberdeen North,,Scotland,Scotland,Burgh,Labour,Lab,PP53,...,Lynn,Thomson,Female,No,No,,12773,0.303433,0.183637,21097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4501,E14001604,E12000003,York Central,,Yorkshire and The Humber,England,Borough,Liberal Democrat,LD,PP90,...,Alan,Page,Male,No,No,,3051,0.070424,-0.007796,116606
4503,E14001604,E12000003,York Central,,Yorkshire and The Humber,England,Borough,Green,Green,PP63,...,Lars,Kramm,Male,No,No,,5185,0.119682,0.077808,50749
4510,E14001605,E12000003,York Outer,,Yorkshire and The Humber,England,County,Green,Green,PP63,...,Michael,Kearney,Male,No,No,,2212,0.043283,0.043118,101413
4511,E14001605,E12000003,York Outer,,Yorkshire and The Humber,England,County,Liberal Democrat,LD,PP90,...,Andrew,Hollyer,Male,No,No,,5496,0.107541,-0.079162,34649


In [9]:
# Reset the index, keeping the old index value for cross-referencing back to the original dataset

main_cleaned = main.reset_index(drop=True)

main_cleaned

Unnamed: 0,ONS ID,ONS region ID,Constituency name,County name,Region name,Country name,Constituency type,Party name,Party abbreviation,Electoral Commission party ID,...,Candidate first name,Candidate surname,Candidate gender,Sitting MP,Former MP,Member MNIS ID,Votes,Share,Change,DC Person ID
0,W07000081,W92000004,Aberafan Maesteg,,Wales,Wales,County,Liberal Democrat,LD,PP90,...,Justin,Griffiths,Male,No,No,,916,0.025619,-0.011412,89373
1,W07000081,W92000004,Aberafan Maesteg,,Wales,Wales,County,Green,Green,PP63,...,Nigel,Hill,Male,No,No,,1094,0.030597,0.014817,92890
2,S14000060,S92000003,Aberdeen North,,Scotland,Scotland,Burgh,Scottish Green Party,Green,PP130,...,Esme,Houston,Female,No,No,,1275,0.030289,0.018080,85482
3,S14000060,S92000003,Aberdeen North,,Scotland,Scotland,Burgh,Liberal Democrat,LD,PP90,...,Desmond,Bouse,Male,No,No,,2583,0.061361,-0.015942,74392
4,S14000060,S92000003,Aberdeen North,,Scotland,Scotland,Burgh,Labour,Lab,PP53,...,Lynn,Thomson,Female,No,No,,12773,0.303433,0.183637,21097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1686,E14001604,E12000003,York Central,,Yorkshire and The Humber,England,Borough,Liberal Democrat,LD,PP90,...,Alan,Page,Male,No,No,,3051,0.070424,-0.007796,116606
1687,E14001604,E12000003,York Central,,Yorkshire and The Humber,England,Borough,Green,Green,PP63,...,Lars,Kramm,Male,No,No,,5185,0.119682,0.077808,50749
1688,E14001605,E12000003,York Outer,,Yorkshire and The Humber,England,County,Green,Green,PP63,...,Michael,Kearney,Male,No,No,,2212,0.043283,0.043118,101413
1689,E14001605,E12000003,York Outer,,Yorkshire and The Humber,England,County,Liberal Democrat,LD,PP90,...,Andrew,Hollyer,Male,No,No,,5496,0.107541,-0.079162,34649


In [10]:
# List columns to enable removal of unnecessary variables

main_cleaned.columns

Index(['ONS ID', 'ONS region ID', 'Constituency name', 'County name',
       'Region name', 'Country name', 'Constituency type', 'Party name',
       'Party abbreviation', 'Electoral Commission party ID', 'MNIS party ID',
       'Electoral Commission adjunct party ID', 'Candidate first name',
       'Candidate surname', 'Candidate gender', 'Sitting MP', 'Former MP',
       'Member MNIS ID', 'Votes', 'Share', 'Change', 'DC Person ID'],
      dtype='object')

In [11]:
# Remove unnecessary columns

main_cleaned = main_cleaned.drop(['County name', 'Country name', 'MNIS party ID', 'Electoral Commission adjunct party ID', 'Member MNIS ID', 'Change', 'DC Person ID'], axis = 1)

In [12]:
# Create a unique identifier to enable joining of other datasets, using ONS ID and Party abbreviation

match_id = main_cleaned['ONS ID'] + "_" + main_cleaned['Party abbreviation']
main_cleaned.insert(loc=0, column='Match ID', value=match_id)

In [13]:
main_cleaned.head()

Unnamed: 0,Match ID,ONS ID,ONS region ID,Constituency name,Region name,Constituency type,Party name,Party abbreviation,Electoral Commission party ID,Candidate first name,Candidate surname,Candidate gender,Sitting MP,Former MP,Votes,Share
0,W07000081_LD,W07000081,W92000004,Aberafan Maesteg,Wales,County,Liberal Democrat,LD,PP90,Justin,Griffiths,Male,No,No,916,0.025619
1,W07000081_Green,W07000081,W92000004,Aberafan Maesteg,Wales,County,Green,Green,PP63,Nigel,Hill,Male,No,No,1094,0.030597
2,S14000060_Green,S14000060,S92000003,Aberdeen North,Scotland,Burgh,Scottish Green Party,Green,PP130,Esme,Houston,Female,No,No,1275,0.030289
3,S14000060_LD,S14000060,S92000003,Aberdeen North,Scotland,Burgh,Liberal Democrat,LD,PP90,Desmond,Bouse,Male,No,No,2583,0.061361
4,S14000060_Lab,S14000060,S92000003,Aberdeen North,Scotland,Burgh,Labour,Lab,PP53,Lynn,Thomson,Female,No,No,12773,0.303433


In [14]:
# Sort by Match ID

main_cleaned.sort_values(by='Match ID', inplace=True)
main_cleaned.head()

Unnamed: 0,Match ID,ONS ID,ONS region ID,Constituency name,Region name,Constituency type,Party name,Party abbreviation,Electoral Commission party ID,Candidate first name,Candidate surname,Candidate gender,Sitting MP,Former MP,Votes,Share
12,E14001063_Green,E14001063,E12000008,Aldershot,South East,Borough,Green,Green,PP63,Ed,Neville,Male,No,No,2155,0.044393
13,E14001063_LD,E14001063,E12000008,Aldershot,South East,Borough,Liberal Democrat,LD,PP90,Paul,Harris,Male,No,No,4052,0.083471
14,E14001063_Lab,E14001063,E12000008,Aldershot,South East,Borough,Labour,Lab,PP53,Alex,Baker,Female,No,No,19764,0.407136
15,E14001064_Green,E14001064,E12000005,Aldridge-Brownhills,West Midlands,Borough,Green,Green,PP63,Clare,Nash,Female,No,No,1746,0.042677
16,E14001064_LD,E14001064,E12000005,Aldridge-Brownhills,West Midlands,Borough,Liberal Democrat,LD,PP90,Ian,Garrett,Male,No,No,1755,0.042897


### Clean & standardise the 2024 constituency results

In [15]:
# Ingest 2024 constituency results

r_24 = pd.read_csv("data/HoC-GE2024-results-by-constituency.csv")

r_24.shape

(650, 32)

In [16]:
# Remove the speaker's constituency, Chorley
# Note that the other exclusions will not be removed at this point, so that they can be used to calculate control variables

r_24 = r_24[r_24['Constituency name'] != 'Chorley']

r_24.shape # check that the dataframe now has one fewer row 

(649, 32)

In [17]:
# Remove constituencies in Northern Ireland

r_24 = r_24[r_24['Region name'] != 'Northern Ireland']

r_24.shape # check that the dataframe now has 18 fewer rows 

(631, 32)

In [18]:
# Ensure party codes match across the two datasets

r_24.columns # display list of columns, and visually check use of abbreviations 'Lab', 'LD' and 'Green'

Index(['ONS ID', 'ONS region ID', 'Constituency name', 'County name',
       'Region name', 'Country name', 'Constituency type', 'Declaration time',
       'Member first name', 'Member surname', 'Member gender', 'Result',
       'First party', 'Second party', 'Electorate', 'Valid votes',
       'Invalid votes', 'Majority', 'Con', 'Lab', 'LD', 'RUK', 'Green', 'SNP',
       'PC', 'DUP', 'SF', 'SDLP', 'UUP', 'APNI', 'All other candidates',
       'Of which other winner'],
      dtype='object')

In [19]:
# Bring up list of columns in main_cleaned, to identify redundant columns in r_24_

main_cleaned.columns

Index(['Match ID', 'ONS ID', 'ONS region ID', 'Constituency name',
       'Region name', 'Constituency type', 'Party name', 'Party abbreviation',
       'Electoral Commission party ID', 'Candidate first name',
       'Candidate surname', 'Candidate gender', 'Sitting MP', 'Former MP',
       'Votes', 'Share'],
      dtype='object')

In [20]:
# Remove unnecessary columns, creating new temp dataframe name in case this cell needs to be re-run in debugging.

r_24_ = r_24.drop(['ONS region ID', 'Constituency name', 'County name', 'Region name', 'Country name', 'Declaration time', 'Member first name', 'Member surname', 'Member gender', 'Electorate', 'Invalid votes'], axis = 1)

In [21]:
r_24_.columns

Index(['ONS ID', 'Constituency type', 'Result', 'First party', 'Second party',
       'Valid votes', 'Majority', 'Con', 'Lab', 'LD', 'RUK', 'Green', 'SNP',
       'PC', 'DUP', 'SF', 'SDLP', 'UUP', 'APNI', 'All other candidates',
       'Of which other winner'],
      dtype='object')

In [22]:
r_24_.shape

(631, 21)

In [23]:
# Check the data type in each column

for col in r_24_.columns:
    print(r_24_[col].apply(type).value_counts())
    print("- - -")

ONS ID
<class 'str'>    631
Name: count, dtype: int64
- - -
Constituency type
<class 'str'>    631
Name: count, dtype: int64
- - -
Result
<class 'str'>    631
Name: count, dtype: int64
- - -
First party
<class 'str'>    631
Name: count, dtype: int64
- - -
Second party
<class 'str'>    631
Name: count, dtype: int64
- - -
Valid votes
<class 'int'>    631
Name: count, dtype: int64
- - -
Majority
<class 'int'>    631
Name: count, dtype: int64
- - -
Con
<class 'int'>    631
Name: count, dtype: int64
- - -
Lab
<class 'int'>    631
Name: count, dtype: int64
- - -
LD
<class 'int'>    631
Name: count, dtype: int64
- - -
RUK
<class 'int'>    631
Name: count, dtype: int64
- - -
Green
<class 'int'>    631
Name: count, dtype: int64
- - -
SNP
<class 'int'>    631
Name: count, dtype: int64
- - -
PC
<class 'int'>    631
Name: count, dtype: int64
- - -
DUP
<class 'int'>    631
Name: count, dtype: int64
- - -
SF
<class 'int'>    631
Name: count, dtype: int64
- - -
SDLP
<class 'int'>    631
Name: count, 

In [24]:
# Calculate percentage vote share for each party

parties = ['Con', 'Lab', 'LD', 'RUK', 'Green', 'SNP', 'PC', 'DUP', 'SF', 'SDLP', 'UUP', 'APNI'] # Create a list of parties

for party in parties:
    r_24_[f'{party}%'] = r_24_[f'{party}'] / r_24_['Valid votes']

In [25]:
# Check to see that 12 columns have been added:

r_24_.shape 

(631, 33)

In [26]:
# To create a control variable for how many credible candidates ran in each constituency ...
# ... determine whether each candidate kept their deposit (the cut-off being _more than_ 5% of the vote)

for party in parties:

    r_24_[f'{party}_5%'] = np.where(r_24_[f'{party}%'] > 0.05, 1, 0)

In [27]:
# Check to see that 12 columns have been added:

r_24_.shape

(631, 45)

In [28]:
# Now create a count of candidates who kept their deposit in each constituency

deposit_cols = [party + '_5%' for party in parties] # Create a list of the appropriate columns

r_24_['Credible'] = r_24_[deposit_cols].sum(axis=1)


In [29]:
# Convert 2024 constituency results into long-form data, grouped by party and constituency ...
# ... such that each row represents a unique campaign and keeps the necessary data for applying the identified controls.

melt_cols = ['Con', 'Lab', 'LD', 'RUK', 
                'Green', 'SNP', 'PC', 'DUP', 'SF', 'SDLP', 'UUP', 'APNI', 'All other candidates'] # List of columns to melt

id_cols = ['ONS ID', 'Result', 'First party', 'Second party', 'Valid votes', 'Majority', 'Credible'] # list of columns to keep constant

r_24_long = r_24_.melt(id_vars=id_cols, value_vars=melt_cols, var_name='Party abbreviation', value_name='Votes')


In [30]:
print(r_24_long.shape)

r_24_long.columns

(8203, 9)


Index(['ONS ID', 'Result', 'First party', 'Second party', 'Valid votes',
       'Majority', 'Credible', 'Party abbreviation', 'Votes'],
      dtype='object')

In [31]:
r_24_long.head()

Unnamed: 0,ONS ID,Result,First party,Second party,Valid votes,Majority,Credible,Party abbreviation,Votes
0,W07000081,Lab hold,Lab,RUK,35755,10354,4,Con,2903
1,S14000060,SNP hold,SNP,Lab,42095,1760,5,Con,5881
2,S14000061,SNP hold,SNP,Lab,46345,3758,5,Con,11300
3,S14000062,SNP gain from Con,SNP,Con,38188,942,5,Con,12513
4,S14000063,Lab gain from SNP,Lab,SNP,36666,7547,3,Con,1696


In [32]:
# Create a unique identifier to enable joining of other datasets, using ONS ID and Party abbreviation

match_id = r_24_long['ONS ID'] + "_" + r_24_long['Party abbreviation']
r_24_long.insert(loc=0, column='Match ID', value=match_id)

In [33]:
r_24_long.head()

Unnamed: 0,Match ID,ONS ID,Result,First party,Second party,Valid votes,Majority,Credible,Party abbreviation,Votes
0,W07000081_Con,W07000081,Lab hold,Lab,RUK,35755,10354,4,Con,2903
1,S14000060_Con,S14000060,SNP hold,SNP,Lab,42095,1760,5,Con,5881
2,S14000061_Con,S14000061,SNP hold,SNP,Lab,46345,3758,5,Con,11300
3,S14000062_Con,S14000062,SNP gain from Con,SNP,Con,38188,942,5,Con,12513
4,S14000063_Con,S14000063,Lab gain from SNP,Lab,SNP,36666,7547,3,Con,1696


In [34]:
r_24_long.sort_values(by='Match ID', inplace=True)

In [35]:
r_24_long.head()

Unnamed: 0,Match ID,ONS ID,Result,First party,Second party,Valid votes,Majority,Credible,Party abbreviation,Votes
6946,E14001063_APNI,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,APNI,0
7577,E14001063_All other candidates,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,All other candidates,282
5,E14001063_Con,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,Con,14081
4422,E14001063_DUP,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,DUP,0
2529,E14001063_Green,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,Green,2155


In [36]:
# Calculate the outcome variable 'won_24' for each campaign

# Create a new column called 'won_24', and set it to contain 1 if the winning party abbreviation matches the party in the row ...
# ... and 0 if it does not

r_24_long['won_24'] = np.where(r_24_long['First party'] == r_24_long['Party abbreviation'], 1, 0)

# Note that we don't need to create the other outcome variable, because it is already provided in the 2024 results by candidate csv file.


In [37]:
r_24_long.head(10)

Unnamed: 0,Match ID,ONS ID,Result,First party,Second party,Valid votes,Majority,Credible,Party abbreviation,Votes,won_24
6946,E14001063_APNI,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,APNI,0,0
7577,E14001063_All other candidates,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,All other candidates,282,0
5,E14001063_Con,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,Con,14081,0
4422,E14001063_DUP,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,DUP,0,0
2529,E14001063_Green,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,Green,2155,0
1267,E14001063_LD,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,LD,4052,0
636,E14001063_Lab,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,Lab,19764,1
3791,E14001063_PC,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,PC,0,0
1898,E14001063_RUK,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,RUK,8210,0
5684,E14001063_SDLP,E14001063,Lab gain from Con,Lab,Con,48544,5683,4,SDLP,0,0


### Incorporate the 2024 constituency results into the main dataframe

In [38]:
# Bring the 2024 result variables into the main dataset, matching on 'Match ID' (i.e. the unique campaign ID)

main_24 = main_cleaned.merge(r_24_long, how="inner", on="Match ID")

In [39]:
# Check that the merge has added new columns only:
# The new file should contain the same number of records as main_cleaned
# And should contain more columns equal to number of columns in r_24_long minus 1

print(main_cleaned.shape)
print(r_24_long.shape)
print(main_24.shape)

(1691, 16)
(8203, 11)
(1691, 26)


In [40]:
# Check that the only repeated columns are 'ONS ID', 'Party abbreviation' and 'Votes'
# Each should appear twice, with the suffix '_x' in the first instance and '_y' in the second

main_24.columns

Index(['Match ID', 'ONS ID_x', 'ONS region ID', 'Constituency name',
       'Region name', 'Constituency type', 'Party name',
       'Party abbreviation_x', 'Electoral Commission party ID',
       'Candidate first name', 'Candidate surname', 'Candidate gender',
       'Sitting MP', 'Former MP', 'Votes_x', 'Share', 'ONS ID_y', 'Result',
       'First party', 'Second party', 'Valid votes', 'Majority', 'Credible',
       'Party abbreviation_y', 'Votes_y', 'won_24'],
      dtype='object')

In [41]:
# Add columns to check that these values match

main_24['Check ONS ID'] = np.where(main_24['ONS ID_x'] == main_24['ONS ID_y'], 1, 0)
main_24['Check Votes'] = np.where(main_24['Votes_x'] == main_24['Votes_y'], 1, 0)
main_24['Check Party Abbreviation'] = np.where(main_24['Party abbreviation_x'] == main_24['Party abbreviation_y'], 1, 0)

In [42]:
# Print % of matching values

print("ONS Match:", main_24['Check ONS ID'].sum() / len(main_24) * 100, "%")

print("Vote Match:", main_24['Check Votes'].sum() / len(main_24) * 100, "%")

print("Party Match:", main_24['Check Party Abbreviation'].sum() / len(main_24) * 100, "%")

ONS Match: 100.0 %
Vote Match: 100.0 %
Party Match: 100.0 %


In [43]:
# Remove check columns and repeated columns

main_24_cleaned = main_24.drop(['ONS ID_y', 'Votes_y', 'Party abbreviation_y', 'Check ONS ID', 'Check Votes', 'Check Party Abbreviation'], axis = 1)

In [44]:
# Rename '_x' columns to exclude that suffix

main_24_cleaned.rename(columns = {'ONS ID_x':'ONS ID', 'Votes_x':'Votes', 'Party abbreviation_x':'Party abbreviation'}, inplace=True)

In [45]:
# Check and confirm final shape and list of columns

print(main_24_cleaned.shape)
print(main_24_cleaned.columns)

(1691, 23)
Index(['Match ID', 'ONS ID', 'ONS region ID', 'Constituency name',
       'Region name', 'Constituency type', 'Party name', 'Party abbreviation',
       'Electoral Commission party ID', 'Candidate first name',
       'Candidate surname', 'Candidate gender', 'Sitting MP', 'Former MP',
       'Votes', 'Share', 'Result', 'First party', 'Second party',
       'Valid votes', 'Majority', 'Credible', 'won_24'],
      dtype='object')


### Clean & standardise the 2019 notional constituency results

In [46]:
# Ingest 2019 notional constituency results

# I know from inspecting the data that some but not all of the numeric values contain commas, which produce NaN values in the dataframe
# I've had to seek ChatGPT's help in how to import the csv in such a way that the commas are removed from the values
# and further below (also flagged in comments) in how to convert those values into numerics

r_19 = pd.read_csv('data/estimates-2019-general-election-result-new-constituencies-data-results.csv', dtype=str)

r_19.columns # show columns 

Index(['Boundary Comm name', 'PA name', 'ONS code', 'PANO', 'region_name',
       'country_name', 'constituency_type', 'Index_of_change', '"turnout"',
       'electorate',
       ...
       'SNP%', 'PC%', 'Tot oths%', 'Max oths%', 'blank3', 'APNI%', 'DUP%',
       'SDLP%', 'SF%', 'UUP%'],
      dtype='object', length=109)

In [47]:
# Since number of columns viewable is limited by VSCode, print one column at a time instead:

for col in r_19.columns:
    print(col)

Boundary Comm name
PA name
ONS code
PANO
region_name
country_name
constituency_type
Index_of_change
"turnout"
electorate
Conv
Labv
LDv
Grnv
Brxv
SNPv
PCv
TotOthv
MaxOthv
Blank
APNIv
DUPv
SDLPv
SFv
UUPv
Blank2
ADVv
AGSv
ANWPv
AONTv
ASHv
BNPv
BPIv
BSJPv
CFv
CHPv
CMEPv
CMUv
CMUKv
COMLv
CPv
CPAv
CVPv
EDv
GWLDv
HWDIv
INETv
JACv
LBTv
LIBv
LNINv
LUTNv
MKv
MRLPv
MTHRv
NEPv
PATRv
PBPv
PFPv
POSv
PPNVv
RBDv
RENv
SDPv
SEPv
SFPv
SHRPv
SLPv
SNAVv
SOCv
TCRPv
TIGv
TLWv
UGPv
UKIPv
VPPv
WEPv
WRPv
WYv
YESHv
YPPv
YRKSv
IND1v
IND2v
IND3v
IND4v
IND5v
IND6v
blank2
total_votes
majority
maj%
Win19
Second19
CON%
LAB%
LD%
GRN%
BRX%
SNP%
PC%
Tot oths%
Max oths%
blank3
APNI%
DUP%
SDLP%
SF%
UUP%


In [48]:
# Isolate the required columns only

cols_to_keep = ['PA name', 'ONS code', 'region_name', 'Conv', 'Labv', 'LDv', 'Grnv', 'Brxv', 'SNPv', 'PCv', 'DUPv', 'SFv', 'SDLPv', 'UUPv', 'APNIv', 'TotOthv',  
                'total_votes', 'majority', 'Win19', 'Second19']

r_19_slim = r_19[cols_to_keep]

r_19_slim.shape

(650, 20)

In [49]:
# Check the data type in each column

for col in r_19_slim.columns:
    print(r_19_slim[col].apply(type).value_counts())
    print("- - -")

PA name
<class 'str'>    650
Name: count, dtype: int64
- - -
ONS code
<class 'str'>    650
Name: count, dtype: int64
- - -
region_name
<class 'str'>    650
Name: count, dtype: int64
- - -
Conv
<class 'str'>      641
<class 'float'>      9
Name: count, dtype: int64
- - -
Labv
<class 'str'>      632
<class 'float'>     18
Name: count, dtype: int64
- - -
LDv
<class 'str'>      632
<class 'float'>     18
Name: count, dtype: int64
- - -
Grnv
<class 'str'>      636
<class 'float'>     14
Name: count, dtype: int64
- - -
Brxv
<class 'str'>      632
<class 'float'>     18
Name: count, dtype: int64
- - -
SNPv
<class 'str'>      632
<class 'float'>     18
Name: count, dtype: int64
- - -
PCv
<class 'str'>      632
<class 'float'>     18
Name: count, dtype: int64
- - -
DUPv
<class 'str'>    650
Name: count, dtype: int64
- - -
SFv
<class 'str'>      648
<class 'float'>      2
Name: count, dtype: int64
- - -
SDLPv
<class 'str'>      649
<class 'float'>      1
Name: count, dtype: int64
- - -
UUPv
<cla

In [50]:
# Remove the speaker's constituency, Chorley
# Note that the other exclusions will not be removed at this point, so that they can be used to calculate control variables

r_19_slim = r_19_slim[r_19_slim['PA name'] != 'Chorley']

r_19_slim.shape # check that the sheet now has one fewer row 

(649, 20)

In [51]:
# Remove constituencies in Northern Ireland, starting by listing values in the region_name column

r_19_slim['region_name'].unique()



array(['WA', 'SC', 'SE', 'WM', 'NW', 'EM', 'NI', 'GL', 'YH', 'EE', 'SW',
       'NE'], dtype=object)

In [52]:
# Now filter down the sheet to remove those with the value 'NI' in region_name

r_19_slim = r_19_slim[r_19_slim['region_name'] != 'NI']

r_19_slim.shape # check that the dataframe now has 18 fewer rows 

(631, 20)

In [53]:
# Need to standardise and correct the list of parties used as columns in the dataframe

# 1. Rename columns to bring in line with House of Commons library datasets. 
r_19_slim.rename(columns={'Grnv':'Greenv', 'TotOthv':'All Other Candidates'}, inplace=True)

# 2. Update the 'parties' list to reflect that Reform UK campaigned as the Brexit party in 2019:

ind_RUK = parties.index('RUK') # establish index of 'RUK'
parties[ind_RUK] = 'Brx' # replace 'RUK with 'Brx'

# 3. Create a list with the suffix 'v' on each party name, in line with the dataset column titles:
parties_v = [party + 'v' for party in parties]

# 4. Rename the relevant column titles (because we need to standardise these across datasets)
renaming = dict(zip(parties_v, parties)) # create a dictionary to map old names to new names
r_19_slim.rename(columns=renaming, inplace=True) # use that dictionary to replace the column names in bulk

In [54]:
# Using code from ChatGPT, convert strings into numerics, handling commas, for all columns in the 'parties' list

r_19_slim.loc[:, parties] = r_19_slim.loc[:, parties].map(lambda x: x.replace(',', '') if isinstance(x, str) else x)
r_19_slim.loc[:, parties] = r_19_slim.loc[:, parties].apply(pd.to_numeric, errors='coerce')

In [55]:
# Re-check the data type in each column

for col in r_19_slim.columns:
    print(r_19_slim[col].apply(type).value_counts())
    print("- - -")

PA name
<class 'str'>    631
Name: count, dtype: int64
- - -
ONS code
<class 'str'>    631
Name: count, dtype: int64
- - -
region_name
<class 'str'>    631
Name: count, dtype: int64
- - -
Con
<class 'int'>    631
Name: count, dtype: int64
- - -
Lab
<class 'int'>    631
Name: count, dtype: int64
- - -
LD
<class 'int'>    631
Name: count, dtype: int64
- - -
Green
<class 'int'>    631
Name: count, dtype: int64
- - -
Brx
<class 'int'>    631
Name: count, dtype: int64
- - -
SNP
<class 'int'>    631
Name: count, dtype: int64
- - -
PC
<class 'int'>    631
Name: count, dtype: int64
- - -
DUP
<class 'int'>    631
Name: count, dtype: int64
- - -
SF
<class 'int'>    631
Name: count, dtype: int64
- - -
SDLP
<class 'int'>    631
Name: count, dtype: int64
- - -
UUP
<class 'int'>    631
Name: count, dtype: int64
- - -
APNI
<class 'int'>    631
Name: count, dtype: int64
- - -
All Other Candidates
<class 'str'>    631
Name: count, dtype: int64
- - -
total_votes
<class 'str'>    631
Name: count, dtype: 

In [56]:
# And check for NAs

for col in r_19_slim.columns:
    print(f'Column: {col}')
    print('NaN count:', r_19_slim[col].isna().sum())
    print('- - -')

Column: PA name
NaN count: 0
- - -
Column: ONS code
NaN count: 0
- - -
Column: region_name
NaN count: 0
- - -
Column: Con
NaN count: 0
- - -
Column: Lab
NaN count: 0
- - -
Column: LD
NaN count: 0
- - -
Column: Green
NaN count: 0
- - -
Column: Brx
NaN count: 0
- - -
Column: SNP
NaN count: 0
- - -
Column: PC
NaN count: 0
- - -
Column: DUP
NaN count: 0
- - -
Column: SF
NaN count: 0
- - -
Column: SDLP
NaN count: 0
- - -
Column: UUP
NaN count: 0
- - -
Column: APNI
NaN count: 0
- - -
Column: All Other Candidates
NaN count: 0
- - -
Column: total_votes
NaN count: 0
- - -
Column: majority
NaN count: 0
- - -
Column: Win19
NaN count: 0
- - -
Column: Second19
NaN count: 0
- - -


In [57]:
# Repeat code from ChatGPT for specific str columns that need to be floats, beyond those in the 'parties' list

other_numerics = ['All Other Candidates', 'total_votes', 'majority']

r_19_slim.loc[:, other_numerics] = r_19_slim.loc[:, other_numerics].map(lambda x: x.replace(',', '') if isinstance(x, str) else x)
r_19_slim.loc[:, other_numerics] = r_19_slim.loc[:, other_numerics].apply(pd.to_numeric, errors='coerce')

In [58]:
# Check the data type in those columns again

for col in other_numerics:
    print(r_19_slim[col].apply(type).value_counts())
    print("- - -")

All Other Candidates
<class 'int'>    631
Name: count, dtype: int64
- - -
total_votes
<class 'int'>    631
Name: count, dtype: int64
- - -
majority
<class 'int'>    631
Name: count, dtype: int64
- - -


In [59]:
# Add a column to show the number of votes for the winner in each constituency:
r_19_slim['max_votes'] = (r_19_slim[parties].max(axis=1)).astype(float)

In [60]:
# Check the column titles for any errors:

r_19_slim.columns

Index(['PA name', 'ONS code', 'region_name', 'Con', 'Lab', 'LD', 'Green',
       'Brx', 'SNP', 'PC', 'DUP', 'SF', 'SDLP', 'UUP', 'APNI',
       'All Other Candidates', 'total_votes', 'majority', 'Win19', 'Second19',
       'max_votes'],
      dtype='object')

In [61]:
# Convert notional 2019 constituency results into long-form data, grouped by party and constituency ...
# ... such that each row represents a unique campaign and keeps the necessary data for applying the identified controls.

id_cols = ['ONS code', 'Win19', 'Second19', 'total_votes', 'majority', 'max_votes'] # list of columns to keep constant

r_19_long = r_19_slim.melt(id_vars=id_cols, value_vars=parties, var_name='Party abbreviation', value_name='Votes')

In [62]:
r_19_long.shape

(7572, 8)

In [63]:
# Create a unique identifier to enable joining of other datasets, using ONS ID and Party abbreviation

match_id = r_19_long['ONS code'] + "_" + r_19_long['Party abbreviation']
r_19_long.insert(loc=0, column='Match ID', value=match_id)

In [64]:
# Calculate the dependent variables:

# 1. Position in 2019, where 1 = most votes

r_19_long['Rank'] = (r_19_long.groupby('ONS code')['Votes'].rank(ascending=False, method='min')) # Dense ensures that parties with 0 votes are all given the lowest tied position


In [65]:
# 2. Difference in raw votes gained between candidate and winning candidate

r_19_long['raw_diff'] = r_19_long['max_votes'] - r_19_long['Votes']


In [66]:
# 3. Difference in percentage vote share between candidate and winning candidate

r_19_long['pc_diff'] = r_19_long['raw_diff'] / r_19_long['total_votes']

In [67]:
r_19_long.info

<bound method DataFrame.info of             Match ID   ONS code Win19 Second19 total_votes majority  \
0      W07000081_Con  W07000081   Lab      Con       44423    13457   
1      S14000060_Con  S14000060   SNP      Con       50127    14210   
2      S14000061_Con  S14000061   SNP      Con       50118     5463   
3      S14000062_Con  S14000062   Con      SNP       45891     2399   
4      S14000058_Con  S14000058   Con      SNP       53345      843   
...              ...        ...   ...      ...         ...      ...   
7567  E14001602_APNI  E14001602   Lab      Con       44759    10396   
7568  E14001603_APNI  E14001603   Con       LD       54128    14638   
7569  W07000112_APNI  W07000112   Con      Lab       36552     1968   
7570  E14001604_APNI  E14001604   Lab      Con       50102    14342   
7571  E14001605_APNI  E14001605   Con      Lab       54750    10782   

      max_votes Party abbreviation  Votes  Rank raw_diff   pc_diff  
0       23509.0                Con  10052   2.

### Incorporate the 2019 notional results into the main dataframe

In [68]:
# Bring the 2019 notional result variables into the main dataset, matching on match_ID (i.e. the unique campaign ID)

main_24_19 = main_24_cleaned.merge(r_19_long, how="inner", on="Match ID")


In [69]:
# The new file should contain the same number of records as main_cleaned
# And should contain more columns equal to number of columns in r_19_long minus 1

print(main_24_cleaned.shape)
print(r_19_long.shape)
print(main_24_19.shape)

(1691, 23)
(7572, 12)
(1676, 34)


In [70]:
# Since the resulting file was missing 15 rows, we need to run the merge again using left, then look for erroneous values.

main_24_19 = main_24_cleaned.merge(r_19_long, how="left", on="Match ID")

In [71]:
# Check and count the data types in each column

for col in main_24_19.columns:
    print(main_24_19[col].apply(type).value_counts())
    print("- - -")

Match ID
<class 'str'>    1691
Name: count, dtype: int64
- - -
ONS ID
<class 'str'>    1691
Name: count, dtype: int64
- - -
ONS region ID
<class 'str'>      1688
<class 'float'>       3
Name: count, dtype: int64
- - -
Constituency name
<class 'str'>    1691
Name: count, dtype: int64
- - -
Region name
<class 'str'>    1691
Name: count, dtype: int64
- - -
Constituency type
<class 'str'>    1691
Name: count, dtype: int64
- - -
Party name
<class 'str'>    1691
Name: count, dtype: int64
- - -
Party abbreviation_x
<class 'str'>    1691
Name: count, dtype: int64
- - -
Electoral Commission party ID
<class 'str'>    1691
Name: count, dtype: int64
- - -
Candidate first name
<class 'str'>    1691
Name: count, dtype: int64
- - -
Candidate surname
<class 'str'>    1691
Name: count, dtype: int64
- - -
Candidate gender
<class 'str'>    1691
Name: count, dtype: int64
- - -
Sitting MP
<class 'str'>    1691
Name: count, dtype: int64
- - -
Former MP
<class 'str'>    1691
Name: count, dtype: int64
- - -
V

In [72]:
# The above result shows 15 unexpected float values in 'ONS code'. Let's look at those values, with the components of match_ID:

main_filtered = main_24_19[[isinstance(value, float) for value in main_24_19['ONS code']]] # create filtered dataframe 
main_filtered = main_filtered[['Constituency name', 'Match ID', 'ONS ID',  'Party abbreviation_x', 
                               'ONS code', 'Party abbreviation_y']] # simplify debugging by showing only the relevant columns
main_filtered

Unnamed: 0,Constituency name,Match ID,ONS ID,Party abbreviation_x,ONS code,Party abbreviation_y
1597,"Ayr, Carrick and Cumnock",S14000107_Green,S14000107,Green,,
1598,"Ayr, Carrick and Cumnock",S14000107_LD,S14000107,LD,,
1599,"Ayr, Carrick and Cumnock",S14000107_Lab,S14000107,Lab,,
1600,"Berwickshire, Roxburgh and Selkirk",S14000108_Green,S14000108,Green,,
1601,"Berwickshire, Roxburgh and Selkirk",S14000108_LD,S14000108,LD,,
1602,"Berwickshire, Roxburgh and Selkirk",S14000108_Lab,S14000108,Lab,,
1603,Central Ayrshire,S14000109_Green,S14000109,Green,,
1604,Central Ayrshire,S14000109_LD,S14000109,LD,,
1605,Central Ayrshire,S14000109_Lab,S14000109,Lab,,
1606,Kilmarnock and Loudoun,S14000110_Green,S14000110,Green,,


In [73]:
# We can see from the above that the apparent float values are actually NaNs ... 
# .. which will have overwritten the blanks resulting from a mismatched match_id
# Manual review of the 2019 notional results CSV demonstrates that these five constituencies have incorrect ONS codes
# To fix this, I'll create a dictionary of the incorrect values and corrected values for ONS code:

correct_ONS = ['S14000107', 'S14000108', 'S14000109', 'S14000110', 'S14000111']
incorrect_ONS = ['S14000006', 'S14000008', 'S14000010', 'S14000040', 'S14000058']

ONS_corrections = dict(zip(incorrect_ONS, correct_ONS))
ONS_corrections

{'S14000006': 'S14000107',
 'S14000008': 'S14000108',
 'S14000010': 'S14000109',
 'S14000040': 'S14000110',
 'S14000058': 'S14000111'}

In [74]:
# Now replace the incorrect values with the correct values in the r_19_long dataset

r_19_long['ONS code'] = r_19_long['ONS code'].replace(ONS_corrections)


In [75]:
#Let's check that the change has worked:

r_19_long[r_19_long['ONS code'] == 'S14000107']

Unnamed: 0,Match ID,ONS code,Win19,Second19,total_votes,majority,max_votes,Party abbreviation,Votes,Rank,raw_diff,pc_diff
20,S14000006_Con,S14000107,SNP,Con,46592,2329,20272.0,Con,17943,2.0,2329.0,0.049987
651,S14000006_Lab,S14000107,SNP,Con,46592,2329,20272.0,Lab,6219,3.0,14053.0,0.301618
1282,S14000006_LD,S14000107,SNP,Con,46592,2329,20272.0,LD,2158,4.0,18114.0,0.388779
1913,S14000006_Brx,S14000107,SNP,Con,46592,2329,20272.0,Brx,0,5.0,20272.0,0.435096
2544,S14000006_Green,S14000107,SNP,Con,46592,2329,20272.0,Green,0,5.0,20272.0,0.435096
3175,S14000006_SNP,S14000107,SNP,Con,46592,2329,20272.0,SNP,20272,1.0,0.0,0.0
3806,S14000006_PC,S14000107,SNP,Con,46592,2329,20272.0,PC,0,5.0,20272.0,0.435096
4437,S14000006_DUP,S14000107,SNP,Con,46592,2329,20272.0,DUP,0,5.0,20272.0,0.435096
5068,S14000006_SF,S14000107,SNP,Con,46592,2329,20272.0,SF,0,5.0,20272.0,0.435096
5699,S14000006_SDLP,S14000107,SNP,Con,46592,2329,20272.0,SDLP,0,5.0,20272.0,0.435096


In [92]:
# Now let's recreate the match_id using the new ONS code:
r_19_long['Match ID'] = r_19_long['ONS code'] + "_" + r_19_long['Party abbreviation']

In [93]:
# And let's check to see that we have a row with one of hte missing match IDs:

r_19_long[r_19_long['Match ID'] == 'S14000107_Green']

Unnamed: 0,Match ID,ONS code,Win19,Second19,total_votes,majority,max_votes,Party abbreviation,Votes,Rank,raw_diff,pc_diff
2544,S14000107_Green,S14000107,SNP,Con,46592,2329,20272.0,Green,0,5.0,20272.0,0.435096


In [94]:
# Now we'll run the merge using 'inner' again and check that it produces 1691 rows.

main_24_19 = main_24_cleaned.merge(r_19_long, how="inner", on="Match ID")
print(main_24_cleaned.shape)
print(r_19_long.shape)
print(main_24_19.shape)

(1691, 23)
(7572, 12)
(1691, 34)


In [95]:
# Success!

In [96]:
# Time to check the resulting dataframe
main_24_19.info

<bound method DataFrame.info of              Match ID     ONS ID ONS region ID    Constituency name  \
0     E14001063_Green  E14001063     E12000008            Aldershot   
1        E14001063_LD  E14001063     E12000008            Aldershot   
2       E14001063_Lab  E14001063     E12000008            Aldershot   
3     E14001064_Green  E14001064     E12000005  Aldridge-Brownhills   
4        E14001064_LD  E14001064     E12000005  Aldridge-Brownhills   
...               ...        ...           ...                  ...   
1686     W07000111_LD  W07000111     W92000004              Wrexham   
1687    W07000111_Lab  W07000111     W92000004              Wrexham   
1688  W07000112_Green  W07000112     W92000004             Ynys Môn   
1689     W07000112_LD  W07000112     W92000004             Ynys Môn   
1690    W07000112_Lab  W07000112     W92000004             Ynys Môn   

        Region name Constituency type        Party name Party abbreviation_x  \
0        South East           Borou

In [97]:
# Check columns to see that only 'ONS ID'/'ONS code', 'Votes' and 'Party Abbreviation' are repeated
main_24_19.columns

Index(['Match ID', 'ONS ID', 'ONS region ID', 'Constituency name',
       'Region name', 'Constituency type', 'Party name',
       'Party abbreviation_x', 'Electoral Commission party ID',
       'Candidate first name', 'Candidate surname', 'Candidate gender',
       'Sitting MP', 'Former MP', 'Votes_x', 'Share', 'Result', 'First party',
       'Second party', 'Valid votes', 'Majority', 'Credible', 'won_24',
       'ONS code', 'Win19', 'Second19', 'total_votes', 'majority', 'max_votes',
       'Party abbreviation_y', 'Votes_y', 'Rank', 'raw_diff', 'pc_diff'],
      dtype='object')

In [103]:
# Rename 'Votes_x' as 'Votes_24' and 'Votes_y' as 'Votes_19'

main_24_19.rename(columns={'Votes_x':'Votes_24', 'Votes_y':'Votes_19'}, inplace=True)

In [104]:
# Add columns to check that ONS ID / ONS code and 'Party Abbreviation' match across the datasets

main_24_19['Check ONS ID'] = np.where(main_24_19['ONS ID'] == main_24_19['ONS code'], 1, 0)
main_24_19['Check Party Abbreviation'] = np.where(main_24_19['Party abbreviation_x'] == main_24_19['Party abbreviation_y'], 1, 0)

In [105]:
# Print % of matching values

print("ONS Match:", main_24_19['Check ONS ID'].sum() / len(main_24_19) * 100, "%")

print("Party Match:", main_24_19['Check Party Abbreviation'].sum() / len(main_24_19) * 100, "%")

ONS Match: 100.0 %
Party Match: 100.0 %


In [106]:
# Remove check columns and repeated columns

main_24_19_cleaned = main_24_19.drop(['ONS code', 'Party abbreviation_y', 'Check ONS ID', 'Check Party Abbreviation'], axis = 1)

In [107]:
# Rename '_x' columns to exclude that suffix

main_24_19_cleaned.rename(columns = {'Party abbreviation_x':'Party abbreviation'}, inplace=True)

In [108]:
# Check final shape of dataframe
print(main_24_19_cleaned.shape)
print(main_24_19_cleaned.info)

(1691, 33)
<bound method DataFrame.info of              Match ID     ONS ID ONS region ID    Constituency name  \
0     E14001063_Green  E14001063     E12000008            Aldershot   
1        E14001063_LD  E14001063     E12000008            Aldershot   
2       E14001063_Lab  E14001063     E12000008            Aldershot   
3     E14001064_Green  E14001064     E12000005  Aldridge-Brownhills   
4        E14001064_LD  E14001064     E12000005  Aldridge-Brownhills   
...               ...        ...           ...                  ...   
1686     W07000111_LD  W07000111     W92000004              Wrexham   
1687    W07000111_Lab  W07000111     W92000004              Wrexham   
1688  W07000112_Green  W07000112     W92000004             Ynys Môn   
1689     W07000112_LD  W07000112     W92000004             Ynys Môn   
1690    W07000112_Lab  W07000112     W92000004             Ynys Môn   

        Region name Constituency type        Party name Party abbreviation  \
0        South East       

### Clean and incorporate the StopTheTories.vote target seat data

In [79]:
# Ingest StopTheTories.vote target seat data

In [80]:
# Bring the target seat variables into the main dataset, matching on constituency IDs
# Ensure that no new records are created (to prevent campaigns excluded in the 2024 & 2019 data from being reintroduced at this step)

In [81]:
# Double check that the dataset contains only Labour, Lib Dem and Green campaigns, excluding incumbents and excluding Chorley.

In [82]:
# Calculate the final control variables required: number of credible candidates, presence of nationalists, presence of Reform, 

## Description & Visualisation of Resulting Dataset

In [83]:
# Number of campaigns

In [84]:
# Number of campaigns per control variable


# Draw out interpretive significance of control variables

In [85]:
# Show plot of 2019 position vs 2024 position, per campaign


# Describe visualised relationship (association, correlation, outliers)


In [86]:
# Show plot of absolute vote difference vs 2024 position, per campaign


# Describe visualised relationship (association, correlation, outliers)

In [87]:
# Show plot of difference in vote share vs 2024 position, per campaign


# Describe visualised relationship (association, correlation, outliers)

## Statistical Analysis of Data

In [88]:
# Run logistic multiple regression upon dependent variable without controls.

In [89]:
# Repeat without outliers

In [90]:
# Re-run with controls

In [91]:
# Repeat without outliers

## Support for Hypotheses

## Conclusions