# Evaluating rules of thumb for selecting target seats
### _Determinants of success for progressive challengers in the 2024 UK General Election_



## Background

The Liberal Democrats (LDs) use three high level measures of a constituency's viability as a potential target seat, derived from results of the most recent general election:

1. the position achieved by the LD candidate (e.g. 2nd, 3rd, 4th) - which I will call 'position'
2. the difference between the LD candidate's number of votes won and the winning candidate's number of votes won - which I will call 'raw vote margin'
3. the difference between the LD candidate's percentage share of votes cast and the winning candidate's percentage share of votes cast - which I will call 'percentage margin'.

In this analysis I seek to determine which of these is the better predictor of whether a non-incumbent candidate will win the constituency at the next election.

Why does the distinction matter? In most cases a party that came second in a constituency will also have a narrow margin to overturn (in both raw and percentage terms). However, close-run elections can result in the party in third or fourth position in the constituency facing only a narrow margin between their result and the winner's. Conversely, a seat can be won so decisively that the party in second place faces a large margin. 

Take the two together and one can have a situation in which one party can face a small margin from third place in one seat, and a large margin from second place in another. 

Take for example two Liberal Democrat results in the 2024 General Election. In the constituency of Exmouth and Exeter East, the Lib Dems came third, but since it was a close race between three parties, they now face a margin of only 3.26% of the vote between their result and the winner's result. Conversely, in Cambridge, the Lib Dems came second, but face a margin of 13.16%. Which is the better seat to target? That is what this analysis seeks to answer.

## Important Limitations

First, it is important to note that many other factors are included in the selection of target seats, such as number of LD councillors, level of fundraising, qualitative sense of the quality of the party's candidate in that seat, and so on. I hope to analyse these other factors in future, once I have compiled the necessary data. They will not be considered in this analysis.

Second, it is important to acknowledge that political parties will very likely have incorporated one or more of these same factors into their determination of which seats to target - which will in turn have affected the results in those seats. As a consequence, a regression result showing that party position explained more variance in outcomes than percentage margin did, for instance, could very well simply indicate that parties concentrated their resources into seats in which they held favourable positions, in preference over seats where they were within close percentage margins of victory.

To control for this effect, I will incorporate data on which seats were understood to have been targetted by three of the parties: Labour, the Liberal Democrats, and the Greens. I am grateful to the tactical voting campaign, StopTheTories.vote, for providing their data on those three parties' target seats. 

Since that data does not include targetting by other parties, I will limit my analysis in this instance to those three parties. I will need to consider the impact upon these three parite sIn turn, since these three parties face more opponents in both Wales and Scotland (whether the nationalist parties compete for their votes), and can arguably face an easier contest in seats where Reform splits the right-wing vote with the Conservatives, 


## Literature

The President of the Liberal Democrats, Lord Pack, and I have recently authored a proposed strategy for the party, entitled 'What Next for the Liberal Democrats?', and available at this url: https://docs.google.com/document/d/11aVzII74yXZ9GaneBXK-_nIHP_ow72guAiiZiRfNFEY. In that document, we use a mixture of the three metrics, since each of them has currency within the party. To help take the strategy forwards, I now seek to determine which is the most useful of the metrics. 

The academic literature includes analysis of these factors, and broadly supports percentage margin being the most important of the three. However, many studies are drawn either from two-party systems (most notably the US) or from proportional systems, such as those in continental Europe. Examples include Jacobson's _The Politics of Congressional Elections_, 2015, which argues for percentage margin being the most important of the factors, within the two party system of US politics, and Bartolini and Mair's _Identity, Competition, and Electoral Availability: The Stabilisation of European Electorates 1885–1985_, 1990, which reaches a similar conclusion about the proportional systems on the continent.

The literature for majoritarian multi-party systems, such as those in use in Britain and Canada, appears to indicate that finishing second might be more important, however. Both the seminal 1969 _Political Change in Britain: The Evolution of Electoral Choice_, by Butler and Stokes, and the 2006 volume _Putting Voters in Their Place: Geography and Elections in Great Britain_ by Johnston & Pattie conclude that coming second is more important than obtaining a narrow percentage margin. 

Since the Liberal Democrats use a mixture of the measures, there is clear scope for further analysis in the British context.



## Research Design

In this first analysis, I will consider each unique 'campaign' in the 2024 election - that is, each unique combination of a political party and a constituency. (Future analysis will include earlier elections.) Whether or not the candidate won the in the constituency in the 2024 general election will be the primary dependent variable of analysis, since winning the seat is the primary consideration of whether or not to target it. Given the circular nature of causality here (parties are likely to target seats they think they can win, so it is possible that we will see outsized effects), I will repeat the analysis for a secondary dependent variable: the resulting percentage vote share of the candidate, which will provide more nuanced indications of the effects of each of the factors under consideration.

As such, we will have two effect sizes to consider for each of the three possible factors: the effect upon the likelihood of winning the seat in 2024 (to be calculated using logistic regression analysis) and the effect upon the percentage vote share achieved in 2024 (to be calculated using Ordinary Least Squares regression.)

For convenience, I will refer to these two dependent variables as the 'outcome' of a campaign in this narrative, including in the hypotheses below. When conducting the analysis and interpreting the findings, however, I will separately evaluate the two dependent variables: whether or not the candidate won the constituency and what percentage vote share they achieved. 


### Hypotheses:

- H1. All three selected factors will be statistically discernible predictors of the outcome of a campaign.
    - H0. Null Hypothesis: one or more of the three selected factors will _not_ be a statistically discerinble predictor of the outcome of the campaign.
- H2. Difference in 2019 percentage vote share will be a stronger predictor of the campaign outcome than difference in 2019 number of votes.
    - H0. Null Hypothesis: difference in 2019 percentage vote share will _not_ be a stronger predictor of the campaign outcome than difference in 2019 number of votes.
- H3. Whether the campaign achieved second place in the 2019 election will be a stronger predictor of the 2024 campaign outcome than either other factor.
    - H0. Null hypothesis: whether the campaign achieved second place in the 2019 election will _not_ be a stronger predictor of the 2024 campaign outcome than either other factor.

As indicated above, I will evaluate these hypotheses for each of the two dependent variables in question.

### Data Sources:

To test these hypotheses, I will need to bring together three data sources:

1. 2024 general election results by constituency - produced by the House of Commons library
2. 2024 general election results by candidate - produced by the House of Commons library
3. Estimates of notional 2019 general election results by constituency - produced by Rallings & Thrasher of Nuffield College, Oxford

(Note also that a party might have stood different candidates in the same constituency in each of the two elections, so ideally I would also take into account data about 2019 candidates, rather than just 2024 candidates. However, the notional 2019 Rallings & Thrasher data only covered results per constituenncy, not per candidate, so I would not be able to establish a match between 2019 and 2024 constituencies.)


### Exclusions:

To ensure valid conclusions can be drawn regarding predictors of the outcomes of non-incumbent campaigns, I will need to remove: 

1. all campaigns by incumbent MPs
2. all campaigns by parties other than Labour, the Liberal Democrats and the Greens, for whom I have data on their choices of target seat
2. all campaigns in the Speaker's constituency, Chorley, which is not contested by the major parties
3. all campaigns in constituencies where a Parliamentary by-election was held between the 2019 and 2024 general elections (because in these the 2019 notional election results will not be valid)

Unlike in the previous analysis, I will not need to remove constituencies in Northern Ireland, because this current analysis is not specific to the performance of the Liberal Democrats (who do not run in those constituencies).


### Controls:

Many other factors could credibly contribute to the outcome of a campaign, including characteristics of the constituency and its population; the candidate; the party; and the campaign itself. Examples of those are as follows:

- The constituency & its population: rural vs urban, economic classification, levels of education, levels of employment, whether it is targetted by each party etc
- The candidate: gender, age, whether or not resides in the constituency, etc
- The national party: support in opinion polls, approval ratings for leader, national expenditure etc.
- The local party: number of councillors, results in local elections, number of members, etc.
- The campaign: quality of message, number of volunteers, local expenditure, volume of literature delivered, etc

Over future analyses I would like to establish the explanatory power of a number of these variables. For this initial analysis, I will seek to control for them, as potentially influential factors on the outcome of each campaign. In this initial analysis, I will limit myself to controlling for those variables that are provided in the House of Commons official results datasets. For future analysis I will source additional data with which to both test and control for other considerations.

From the house of Commons official results data, I will use the following as controls:

- Candidates:
    1. Gender
    2. Former MP: whether or not the candidate has ever previously been a candidate
    3. Party
- Constituency:
    4. Party who won the seat in 2019
    5. Number of credible candidates campaigning (to be determined by whether the candidate lost their deposit by securing 5% or less of the vote)
    6. Whether or not a nationalist (i.e. SNP or Plaid) candidate stood for election without losing their deposit
    7. Whether or not a Reform UK candidate stood for election without losing their deposit
    8. Region of constituency
    9. Type of constituency (borough / county) - noting that Scottish boroughs are coded as 'burgh', so will need to be consolidated with boroughs.

Additionally, from the StopTheTories.vote data, I will use their measures of:

- Target Seats
    10. Whether the seat was a target for Labour
        - ... for the Greens,
        - ... for the Liberal Democrats,
        - ... of for none of the above.

For each of the above, my rationale is as follows:

1. Gender of candidate. Given that only 40% of Parliamentarians are female, it is possible that two otherwise identical candidates might receive differing numbers of votes based on their gender alone.

2. Whether candidate is former MP. Former MPs could be expected to have greater name recognition than other candidates.

3. Party of candidate. The 2024 election was characterised by anti-Conservative sentiment. We could expect Conservative candidates to perform worse than other candidates.

4. Winning party in 2019. As above, the 2024 general election was characterised by anti-Conservative sentiment. Challenger candidates running against incumbent Conservatives could be expected to perform better than otherwise identical candidates running against other incumbents.

5. Number of credible candidates campaigning. The effect of the number of candidates upon the dependent variable could be expected to be limited by the number of candidates who were credible contenders. To determine credibility, we will use the existing measure of whether a candidate lost their deposit (triggered by achieving 5% or less of the vote). 

8. Region of constituency. Many political parties obtain better results in some regions of the UK than others (such as Labour in the North of England), and many target their efforts towards particular regions (such as the Liberal Democrats in the South West and South East of England). The 'Region' variable also includes the country, so covers similar effects between Wales, Scotland, Northern Ireland and the regions of England.

9. Type of constituency. Some political parties traditionally perform better in urban areas (e.g. Labour) and others traditionally perform better in rural areas (e.g. the Conservatives). 

10. A party that targetted a seat will have invested more resources into that campaign (such as advertising spending, person-hours of campaigning by activists, visits by high-profile politicians etc), making it significantly more likely to win.

## Import libraries

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as stats
import scipy as sc


## Data Wrangling

In [74]:
# Ingest 2024 candidate data and label as main dataset

main = pd.read_csv("data/HoC-GE2024-results-by-candidate.csv")

main.shape # establish number of rows

(4515, 22)

In [75]:
# Remove candidates in the Speaker's constituency, Chorley

main = main[main['Constituency name'] != 'Chorley']

main.shape # check number of rows has reduced by 5.

(4510, 22)

In [43]:
# Remove incumbent candidates

In [47]:
# Remove campaigns run by parties other than Labour, Liberal Democrats and the Greens

In [44]:
# Ingest 2024 constituency results

In [45]:
# Ensure party codes match across the two datasets

In [46]:
# Remove the speaker's constituency, Chorley
# Note that the other exclusions will not be removed at this point, so that they can be used to calculate control variables

In [48]:
# Convert 2024 constituency results into long-form data, grouped by party and constituency ...
# ... such that each row represents a unique campaign and keeps the necessary data for applying the identified controls.

In [49]:
# Calculate the outcome variables for each campaign: won_2024 and percentage_2024

In [50]:
# Bring the 2024 notional result variables into the main dataset, matching on consituency IDs
# Ensure that no new records are created (to prevent excluded campaigns being reintroduced to the data)

In [51]:
# Ingest 2019 notional constituency results

In [52]:
# Ensure party codes match match between this and the main dataset

In [53]:
# Convert notional 2019 constituency results into long-form data, grouped by party and constituency ...
# ... such that each row represents a unique campaign and keeps the necessary data for applying the identified controls.

In [54]:
# Calculate the three dependent variables:

# Position in 2019

# Difference in absolute votes gained between candidate and winning candidate

# Difference in percentage vote share between candidate and winning candidate

In [55]:
# Bring the 2019 notional result variables into the main dataset, matching on consituency IDs
# Ensure that no new records are created (to prevent campaigns excluded in the 2024 data from being reintroduced at this step)

In [None]:
# Double check that the dataset contains only Labour, Lib Dem and Green campaigns, excluding incumbents and excluding Chorley.

## Description & Visualisation of Resulting Dataset

In [56]:
# Number of campaigns

In [57]:
# Number of campaigns per control variable


# Draw out interpretive significance of control variables

In [58]:
# Show plot of 2019 position vs 2024 position, per campaign


# Describe visualised relationship (association, correlation, outliers)


In [59]:
# Show plot of absolute vote difference vs 2024 position, per campaign


# Describe visualised relationship (association, correlation, outliers)

In [60]:
# Show plot of difference in vote share vs 2024 position, per campaign


# Describe visualised relationship (association, correlation, outliers)

## Statistical Analysis of Data

In [61]:
# Run logistic multiple regression upon dependent variable without controls.

In [62]:
# Repeat without outliers

In [63]:
# Re-run with controls

In [64]:
# Repeat without outliers

## Support for Hypotheses

## Conclusions