<a href="https://colab.research.google.com/github/mggg/Training_Materials_25/blob/main/notebooks/practitioners/Thursday/load_clean_run.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load a profile, clean a profile, run an election

In this tutorial notebook, we show how VoteKit can be used to load the cast vote record, clean the ballots, and then run an STV election.

In November 2022, voters in Portland, Oregon approved an overhaul to their system of election.
Previously, Portland had a commission model, with four councillors elected at-large, plus the mayor
as a member of the council. In that system, candidates ran for numbered seats—for instance "City
Commissioner Position 4"—for which the whole city voted. The new system holds the mayor separate and expands to 12 seats, with four geographical districts electing three councillors each through ranked-choice voting.

The precise system of election now in place is called STV or "single transferable vote"; support
from roughly a quarter of the district’s voters is the threshold required for election. Voters can rank
up to six candidates, and rounds of tabulation are conducted with support transferring down the
ballot until three candidates cross the threshold.

The Data and Democracy Lab released a report studying the election, in particular the mechanics of the STV election to explain that the voting system played a direct role in securing such strong representation for communities of color. You can read the report [here.](https://mggg.org/ppm)


## Cast Vote Record (CVR)

The city of Portland released the cast vote record (CVR) for the election in a format that reflected the scantron style ballot. Each voter was given a row in a table, and each candidate was given 6 columns, one for each ranking position. A vote for a candidate in position i was recorded as a 1 in that candidate's "Ranking i" column, and 0 otherwise. This allows for the possiblity of overvotes---multiple candidates can have a 1 in their "Ranking i" column--- and skips---no candidate has a 1 in their "Ranking i" column.

In order to make this format readable by VoteKit, we need to transform it so that there are only 6 columns total. Each column represents one position of a ranking, and the entry of that column is the candidate ranked in that position.

In addition to this format, the city also released the data by district, but included every voter from the entire city in each data set. So we will have to scrub the voters that are from the other 3 districts.

### Reformatting the raw data

First, we will read in the csv of the cast vote record, scrub the non-district 1 voters, and create new columns that match the format needed by VoteKit. The csv is too large to be stored in GitHub, so [here is a link.](https://multco.us/info/turnout-and-statistics-november-2024-general-election) You will want the "Councilor District 1 Cast Vote Record Data". Be sure to save it in your working directory.

After we reformat the data, we will use VoteKit to perform the rest of the cleaning.

In [1]:
import pandas as pd

D1_df = pd.read_csv("/content/City_of_Portland__Councilor__District_1_2024_11_29_17_26_12.cvr.csv") #insert the file name that you used when you downloaded the csv
D1_df.head()

Unnamed: 0,RowNumber,BoxID,BoxPosition,BallotID,PrecinctID,BallotStyleID,PrecinctStyleName,ScanComputerName,Status,Remade,...,"Choice_122_1:City of Portland, Councilor, District 1:3:Number of Winners 3:Write-in-122:NON","Choice_122_1:City of Portland, Councilor, District 1:4:Number of Winners 3:Write-in-122:NON","Choice_122_1:City of Portland, Councilor, District 1:5:Number of Winners 3:Write-in-122:NON","Choice_122_1:City of Portland, Councilor, District 1:6:Number of Winners 3:Write-in-122:NON","Choice_50003_1:City of Portland, Councilor, District 1:1:Number of Winners 3:Uncertified Write In:NON","Choice_50003_1:City of Portland, Councilor, District 1:2:Number of Winners 3:Uncertified Write In:NON","Choice_50003_1:City of Portland, Councilor, District 1:3:Number of Winners 3:Uncertified Write In:NON","Choice_50003_1:City of Portland, Councilor, District 1:4:Number of Winners 3:Uncertified Write In:NON","Choice_50003_1:City of Portland, Councilor, District 1:5:Number of Winners 3:Uncertified Write In:NON","Choice_50003_1:City of Portland, Councilor, District 1:6:Number of Winners 3:Uncertified Write In:NON"
0,1,RCV-0001,1,RCV-0001+10003,26,3,4506-1,ScanStation6,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,RCV-0001,2,RCV-0001+10005,32,1,2804-1,ScanStation6,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,RCV-0001,3,RCV-0001+10007,53,1,3303-1,ScanStation6,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,RCV-0001,4,RCV-0001+10009,22,1,4105-1,ScanStation6,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,RCV-0001,5,RCV-0001+10011,53,1,3303-1,ScanStation6,0,0,...,0,0,0,0,0,0,0,0,0,0


Wow, 130 columns is a lot, too many for the dataframe to display. Let's look at them just to get familiar with the data set.

In [2]:
for column in D1_df.columns:
    print(column)

RowNumber
BoxID
BoxPosition
BallotID
PrecinctID
BallotStyleID
PrecinctStyleName
ScanComputerName
Status
Remade
Choice_20_1:City of Portland, Councilor, District 1:1:Number of Winners 3:Peggy Sue Owens:NON
Choice_20_1:City of Portland, Councilor, District 1:2:Number of Winners 3:Peggy Sue Owens:NON
Choice_20_1:City of Portland, Councilor, District 1:3:Number of Winners 3:Peggy Sue Owens:NON
Choice_20_1:City of Portland, Councilor, District 1:4:Number of Winners 3:Peggy Sue Owens:NON
Choice_20_1:City of Portland, Councilor, District 1:5:Number of Winners 3:Peggy Sue Owens:NON
Choice_20_1:City of Portland, Councilor, District 1:6:Number of Winners 3:Peggy Sue Owens:NON
Choice_21_1:City of Portland, Councilor, District 1:1:Number of Winners 3:Timur Ender:NON
Choice_21_1:City of Portland, Councilor, District 1:2:Number of Winners 3:Timur Ender:NON
Choice_21_1:City of Portland, Councilor, District 1:3:Number of Winners 3:Timur Ender:NON
Choice_21_1:City of Portland, Councilor, District 1:4:N

Here we can see that each candidate gets six columns, one for each ranking position. Remember, in order to make this format readable by VoteKit, we need to transform it so that there are only 6 columns total. Each column represents one position of a ranking, and the entry of that column is the candidate ranked in that position.

For now we want to just keep track of the columns that have ranking data.

In [3]:
# stores all columns that have ranking information
rank_columns = {i:[col for col in D1_df.columns if f'{i}:Number' in col] for i in range(1,7)}
all_rank_cols = [col for col_list in rank_columns.values() for col in col_list]

The code below scrubs any voter who did not cast at least one vote, which in turn removes any voter not from district 1.

In [4]:
D1_voters_df = D1_df[D1_df[all_rank_cols].sum(axis=1) > 0].reset_index(drop=True) # just resets the index of the df

We now add the new ranking columns that match the VoteKit format.

In the process, we will lose some information about overvotes, when voters put more than one candidate in a ranking.

(warning about runtime:  in a local installation, this cleaning block takes 30 seconds, but in Colab it can take 4 minutes or more!)



In [6]:
from tqdm.notebook import tqdm

In [5]:
ranking_data = {i:[-1 for _ in range(len(D1_voters_df))] for i in range(1,7)}

for voter_index, row in tqdm(D1_voters_df.iterrows()):
    for rank_position in range(1,7):
        num_votes_cast = row[rank_columns[rank_position]].sum()

        if num_votes_cast == 0:
            cast_vote = ""

        elif num_votes_cast > 1:
            cast_vote = "overvote"

            # here we lost knowledge of who was in the overvote. That's how Portland runs their election
            # system, but it could be interesting to study who is in the overvote!

        else:
            # find candidate name from column
            pd_series = row[rank_columns[rank_position]]
            cast_vote_column_name = pd_series.loc[pd_series == 1].index.tolist()[0]
            cast_vote = cast_vote_column_name.split(":")[-2]

        ranking_data[rank_position][voter_index] = cast_vote

# add the new columns
for rank_position in range(1,7):
    D1_voters_df[f"Rank {rank_position}"] = ranking_data[rank_position]

In [7]:
ranking_df = D1_voters_df[[f"Rank {rank_position}" for rank_position in range(1,7)]]
ranking_df.head()

Unnamed: 0,Rank 1,Rank 2,Rank 3,Rank 4,Rank 5,Rank 6
0,Terrence Hayes,Loretta Smith,Noah Ernst,,,
1,Loretta Smith,Steph Routh,Timur Ender,David Linn,Candace Avalos,Jamie Dunphy
2,Loretta Smith,Steph Routh,Timur Ender,David Linn,Candace Avalos,Jamie Dunphy
3,Michael (Mike) Sands,Doug Clove,Joe Furi,Timur Ender,Deian Salazar,Loretta Smith
4,Timur Ender,Candace Avalos,Cayle Tern,Steph Routh,Michael (Mike) Sands,Jamie Dunphy


Now it is in the correct format for VoteKit to read, so we can save it to a csv. Choose a file name that makes sense to you.

In [8]:
ranking_df.to_csv("your_file_name_here.csv") #

## Load a profile, clean a profile
Now that the csv is in the correct format for VoteKit, we can complete our cleaning using VoteKit's built in cleaning tools. Along the way, we will touch on how to load a profile from a file.

CVR stands for cast vote record, so what we are doing below is accessing the part of VoteKit that allows you to load CVRs, and asking for the function that loads csvs. Replace the file name with whatever you used above.

Note that we need to tell VoteKit which columns of the csv contain the ranking data. Open the csv file you saved above to see what columns you need. Don't forget that Python starts counting from 0.

In [None]:
from votekit.cvr_loaders import load_csv

rank_cols = [] # type the numbers of the columns you need here, like [5,7,12,14]
raw_profile = load_csv("/content/your_file_name_here.csv", rank_cols=rank_cols)

In [14]:
raw_profile

Profile contains rankings: True
Maximum ranking length: 6
Profile contains scores: False
Candidates: ('Candace Avalos', 'Cayle Tern', 'Jamie Dunphy', 'Loretta Smith', 'Steph Routh', 'Doug Clove', 'Michael (Mike) Sands', 'David Linn', 'Timur Ender', 'Deian Salazar', 'Peggy Sue Owens', 'Joe Allen', 'Joe Furi', 'Terrence Hayes', 'Noah Ernst', 'Thomas Shervey', 'Uncertified Write In', 'Write-in-121', 'Write-in-122', 'Write-in-120', 'overvote')
Candidates who received votes: ('Candace Avalos', 'Cayle Tern', 'Jamie Dunphy', 'Loretta Smith', 'Steph Routh', 'Doug Clove', 'Michael (Mike) Sands', 'David Linn', 'Timur Ender', 'Deian Salazar', 'Peggy Sue Owens', 'Joe Allen', 'Joe Furi', 'Terrence Hayes', 'Noah Ernst', 'Thomas Shervey', 'Uncertified Write In', 'Write-in-121', 'Write-in-122', 'Write-in-120', 'overvote')
Total number of Ballot objects: 19933
Total weight of Ballot objects: 43669

Notice above that there is a difference between the number of total ballots and the total weight. This indicates that the profile has been grouped; that is, ballots with the same ranking have been aggregated so that there is one ballot, but with increased weight. We need to be careful and sum the ballot weights, not the number of ballots, if we want to know the total number of voters.

In [17]:
num_ballots_cast = raw_profile.total_ballot_wt

print("There were",num_ballots_cast,"ballots cast.")

There were 43669 ballots cast.


In the rules of Portland's election, which you can find [here](), any skipped positions and overvotes are ignored by the STV algorithm, and any candidates that were ranked below the position are moved up. The same thing occurs to three of the write in categories, but oddly enough, not to the "Uncertified Write-in" category.

While Portland did not alter the ballots themselves, but rather told the STV algorithm how to ignore ballot errors, this is mathematically equvalent to pre-processing the ballots. VoteKit's ``remove_and_condense`` function removes candidates and then condenses any ballot positions left empty after scrubbing the given candidates.

In [18]:
from votekit.cleaning import remove_and_condense

cleaned_profile = remove_and_condense(['overvote', 'Write-in-120', 'Write-in-121', 'Write-in-122'], raw_profile)

We also have to handle one more item of cleaning. It is entirely possible that a voter listed the same candidate more than once on their ballot, which is not allowed. Portland chose to keep the first occurrence, and ignore any later occurrences, condensing any positions left empty as a result.

In [19]:
from votekit.cleaning import remove_repeated_candidates, condense_profile

cleaned_profile = condense_profile(remove_repeated_candidates(cleaned_profile))

Finally, the profile is cleaned and we can save it for analysis. We save it as a thing called a "pickle file," which is a way of storing Python variables. Choose a file name that makes sense to you!

In [34]:
cleaned_profile.to_pickle("thingie.pkl")

print(f"Before cleaning there were {num_ballots_cast} many ballots cast.")
print(f"After cleaning, there are now {cleaned_profile.total_ballot_wt} ballots.")
print("This means that",num_ballots_cast - cleaned_profile.total_ballot_wt,"ballots, or", round(100*float((num_ballots_cast - cleaned_profile.total_ballot_wt)/num_ballots_cast),4),"percent, were scrubbed due to cleaning.")

Before cleaning there were 43669 many ballots cast.
After cleaning, there are now 42871 ballots.
This means that 798 ballots, or 1.8274 percent, were scrubbed due to cleaning.


## Running the election

Finally, we have a CVR that is cleaned and ready to be run through the STV election. Change the file name below to whatever name you saved your pickle file to.

In [36]:
from votekit.pref_profile import PreferenceProfile

profile = PreferenceProfile.from_pickle("thingie.pkl")

VoteKit makes it very easy to run an election. Just import the desired election type, and choose the number of seats up for election.

In [37]:
from votekit.elections import STV

# 3 seat election
election = STV(profile, m=3)

Do we have the correct candidates? Do we have the same vote totals? Do we get the same STV winner set? The Election object, called `election` here, has lots of built in methods that allow us to check these stats.

In district 1, Avalos, Dunphy, and Smith were elected. The winners, first place vote distribution, and lots of other stats we can double check, are given [here](https://www.portland.gov/sites/default/files/2024/Portland-District-1-Certified-Abstract-Nov-2024.pdf).

In [38]:
print("Winners in order of election")
i=0
for cand_set in election.get_elected():
    for cand in cand_set:
        i+=1
        print(i, cand)

Winners in order of election
1 Candace Avalos
2 Loretta Smith
3 Jamie Dunphy


In [39]:
# threshold
print(f"Election Threshold: {election.threshold:,}")

Election Threshold: 10,718


In [40]:
from votekit.utils import first_place_votes

fpv_dict = first_place_votes(profile)
cands_sorted_by_fpv = sorted(zip(fpv_dict.keys(), fpv_dict.values()), # creates a list of tuples, (name, fpv)
                                reverse=True,  #decreasing order
                                key = lambda x: x[1], # sort by second element of tuple, which is fpv)
                                    )

print("Candidates in decreasing order of first-place votes.\n")
for cand, fpv in cands_sorted_by_fpv:
    print(cand, fpv)

Candidates in decreasing order of first-place votes.

Candace Avalos 8297
Loretta Smith 5586
Jamie Dunphy 5064
Noah Ernst 4052
Terrence Hayes 3975
Steph Routh 3894
Timur Ender 3550
Doug Clove 1698
Peggy Sue Owens 1266
David Linn 1111
Joe Allen 978
Michael (Mike) Sands 952
Deian Salazar 720
Cayle Tern 711
Thomas Shervey 385
Joe Furi 355
Uncertified Write In 277


Take a moment to verify these against the [official record](https://www.portland.gov/sites/default/files/2024/Portland-District-1-Certified-Abstract-Nov-2024.pdf).

## Comparing to other systems

VoteKit makes it very easy to try running the same profile through different election methods. This allows us to answer questions like "If Portland had used a Plurality election, who would have won?"

In [41]:
from votekit.elections.election_types.ranking import CondoBorda, Plurality, Borda


alt_elections = {"Condorcet": CondoBorda(profile, m=3),
                 "Borda": Borda(profile, m=3),
                 "Plurality": Plurality(profile, m=3),
                 }

for e_name, e in alt_elections.items():
    print(e_name)
    e_winners = [c for s in e.get_elected() for c in s]
    print("In order of election, the winners are")

    for i, winner in enumerate(e_winners):
        print(f"{i}) {winner}")
    print()

Condorcet
In order of election, the winners are
0) Candace Avalos
1) Steph Routh
2) Jamie Dunphy

Borda
In order of election, the winners are
0) Candace Avalos
1) Steph Routh
2) Loretta Smith

Plurality
In order of election, the winners are
0) Candace Avalos
1) Loretta Smith
2) Jamie Dunphy



Go to the list of ranking elections that VoteKit supports https://votekit.readthedocs.io/en/latest/social_choice_docs/scr/#ranking-based, and try importing one and applying it to the Portland profile.

In [None]:
from votekit.elections.election_types.ranking import ??????

election = ??????(profile, m=)

Try changing the number of seats `m`, either on the election type you just imported, or on some of the elections we previously used. Who wins then?

In [None]:
# your code here