# Data sources, data acquisition, data prep

This notebook will show you a variety of data sources for profiles of ballots, including Scottish STV elections, Minnesota IRV elections, and those collected by FairVote. This notebook will also show you what kind of cleaning is often required, and how do to that cleaning with VoteKit.

## Scottish Profiles

Scottish elections give us a great source for real-world ranked data, because STV is used for local government elections. Thanks to David McCune of William Jewell College, we have a fantastic [repository](https://github.com/mggg/scot-elex) of shiny, clean ranking data from over 1000 elections, which feature 3-14 candidates apiece, running with a party label.

Go to the [repository](https://github.com/mggg/scot-elex), choose a locality, and download the csv file to your working directory (the same folder as your code). You will need to edit the code below to reflect your file name.

In [None]:
from votekit.cvr_loaders import load_scottish

# the load_scottish function returns a tuple of information:
# the first element is the profile itself, the second is the number of seats in the election
# the third is a list of candidates, the fourth a dictionary mapping candidates to parties,
# and the fourth the ward name
scottish_profile, num_seats, cand_list, cand_to_party, ward = load_scottish("../../../data/west_dunbartonshire_2017_ward2.csv") 

Let's quickly look at each of the returned variables.

In [None]:
print(f"This election took place in {ward}.")
print(f"The number of seats up for election was {num_seats}.")
print(f"The number of candidates was {len(cand_list)}.")

In [None]:
from votekit.pref_profile import profile_df_head
print(scottish_profile)
print()
print("The top 10 ballots by weight are")
print(profile_df_head(scottish_profile, 10).to_string())


In Scottish elections, voters can rank up to the number of candidates. The most common vote in Scottish elections tends to be a ballot of length `num_seats`, followed by bullet votes (votes for one candidate).

One of the utilities of this repository of elections is that the candidates are labeled with what party they ran under.

In [None]:
for cand, party in cand_to_party.items():
    print(f"{cand} ran under the following party: {party}\n")

Scottish elections use the STV mechanism, so let's quickly see who the winner set is.

In [None]:
from votekit.elections import STV

e = STV(scottish_profile, m=num_seats)

print(e.get_elected())

We read this tuple as a ranking: the first entry of the tuple is the candidate elected first, etc. Your tuple should look something like `(frozenset({'Ian Dickson'}), frozenset({'Jim Bollan'}), frozenset({'John Kelly Millar'}), frozenset({'Caroline Mcallister'}))`.
This means Ian Dickson was elected first, then Jim Bollan, then John Kelly Millar, then Caroline Mcallister.

## Minnesota 2013


Another possible data source is real-world elections that return their cast vote records (CVRs) as csv files. To be readable by VoteKit, the csv file must have a row for each voter, and must have one column per ranking position.

The Minnesota 2013 Mayoral race, which used IRV, did just that. Let's load the csv file into VoteKit. You can find the file [here](https://github.com/mggg/Training_Materials_25/blob/main/data/mn_2013_cast_vote_record.csv). Download it and put it into your working directory.

Voters were allowed to rank three candidates.



In [None]:
from votekit.cvr_loaders import load_csv

mn_profile = load_csv("../../../data/mn_2013_cast_vote_record.csv", rank_cols=[0,1,2]) # the first 3 columns of the csv hold the ranking information     
                                                                                     # in order from 1st place to 3rd place

Let's look at the candidates for the race.

In [None]:
for candidate in mn_profile.candidates:
    print(candidate)

Woah, that’s a little funky! There are candidates called ‘undervote’, ‘overvote’, and ‘UWI’. This cast vote record was already cleaned by the City of Minneapolis, and they chose this way of parsing the ballots: ‘undervote’ indicates that the voter left a position unfilled, such as by having no candidate listed in second place. The ‘overvote’ notation arises when a voter puts two candidates in one position, like by putting Hodges and Samuels both in first place. Unfortunately this way of storing the profile means we have lost any knowledge of the voter intent (which was probably to indicate equal preference). ‘UWI’ stands for unregistered write-in.

This reminds us that it is really important to think carefully about how we want to handle cleaning ballots, as some storage methods are efficient but lossy. For now, let’s assume that we want to further condense the ballots, discarding ‘undervote’, ‘overvote’, and ‘UWI’ as candidates. 

This happens in two stages. First, we remove the "candidates" from the rankings in the ballots. So a ballot (overvote, Betsy Hodges, UWI) becomes the ballot (, Betsy Hodges, ). Then we *condense* the ballot so (, Betsy Hodges, ) becomes (Betsy Hodges).

In [None]:
from votekit.cleaning import remove_cand, condense_profile

remove_cand_mn_profile = remove_cand(["overvote", "undervote", "UWI"], mn_profile)

Let's see that the three "candidates" have been removed and that the ballots have *not* been condensed yet.

In [None]:
print("The following candidates appear in the uncleaned profile but have been removed.")
print(set(mn_profile.candidates)-set(remove_cand_mn_profile.candidates))

In [None]:
for ballot in remove_cand_mn_profile.ballots:
    if frozenset() in ballot.ranking:
        print("Here is a ballot that had a 'candidate' removed but is not yet condensed.")
        print(ballot)
        break

To complete the cleaning, we apply `condense_profile`, which moves up any lower ranked candidates as a result of the removal of the three non-candidates.

In [None]:
cleaned_mn_profile = condense_profile(remove_cand_mn_profile)

In [None]:
not_condensed = False
for ballot in cleaned_mn_profile.ballots:
    not_condensed = not_condensed or (frozenset() in ballot.ranking)

if not_condensed is True:
    print("Something went wrong, a ballot is not condensed.")
else:
    print("All ballots are condensed")
    

Now all of the ballots are properly formatted to run an IRV election.

In [None]:
from votekit.elections import IRV

e = IRV(mn_profile)

print(e.get_elected())

## Cleaning a csv before VoteKit

Sometimes, the format of a CVR released by a locality does not match what is required of VoteKit. Recall, to read a csv file, VoteKit needs one voter per row, and one column per ranking position.

In the 2024 Portland, OR City Council election, the csv was released in a different format. The city of Portland released the cast vote record for the election in a format that reflected the scantron style ballot. Each voter was given a row in a table, and each candidate was given 6 columns, one for each ranking position. A vote for a candidate in position i was recorded as a 1 in that candidate's "Ranking i" column, and 0 otherwise. This allows for the possiblity of overvotes---multiple candidates can have a 1 in their "Ranking i" column--- and skips---no candidate has a 1 in their "Ranking i" column.

In order to make this format readable by VoteKit, we need to transform it so that there are only 6 columns total. Each column represents one position of a ranking, and the entry of that column is the candidate ranked in that position.

In addition to this format, the city also released the data by district, but included every voter from the entire city in each data set. So we will have to scrub the voters that are from the other 3 districts.

First, we will read in the csv of the cast vote record, scrub the non-district 1 voters, and create new columns that match the format needed by VoteKit. The csv is too large to be stored in GitHub, so [here is a link.](https://multco.us/info/turnout-and-statistics-november-2024-general-election) You will want the "Councilor District 1 Cast Vote Record Data". Be sure to save it in your working directory.

After we reformat the data, we will use VoteKit to perform the rest of the cleaning.

In [None]:
import pandas as pd

D1_df = pd.read_csv("../../../data/Portland_D1_raw_from_city.csv") #insert the file name that you used when you downloaded the csv
D1_df.head()

Wow, 130 columns is a lot, too many for the dataframe to display. Let's look at them just to get familiar with the data set.

In [None]:
for column in D1_df.columns:
    print(column)

Here we can see that each candidate gets six columns, one for each ranking position. Remember, in order to make this format readable by VoteKit, we need to transform it so that there are only 6 columns total. Each column represents one position of a ranking, and the entry of that column is the candidate ranked in that position.

For now we want to just keep track of the columns that have ranking data.

In [None]:
# stores all columns that have ranking information
rank_columns = {i:[col for col in D1_df.columns if f'{i}:Number' in col] for i in range(1,7)}
all_rank_cols = [col for col_list in rank_columns.values() for col in col_list]

The code below scrubs any voter who did not cast at least one vote, which in turn removes any voter not from district 1.

In [None]:
D1_voters_df = D1_df[D1_df[all_rank_cols].sum(axis=1) > 0].reset_index(drop=True) # just resets the index of the df

We now add the new ranking columns that match the VoteKit format.

In the process, we will lose some information about overvotes, when voters put more than one candidate in a ranking.
 
(Warning about runtime:  in a local installation, this cleaning block takes 30 seconds, but in Colab it can take 4 minutes or more!)



In [None]:
from tqdm.notebook import tqdm

ranking_data = {i:[-1 for _ in range(len(D1_voters_df))] for i in range(1,7)}

for voter_index, row in tqdm(D1_voters_df.iterrows()):
    for rank_position in range(1,7):
        num_votes_cast = row[rank_columns[rank_position]].sum()

        if num_votes_cast == 0:
            cast_vote = ""

        elif num_votes_cast > 1:
            cast_vote = "overvote"

            # here we lost knowledge of who was in the overvote. That's how Portland runs their election
            # system, but it could be interesting to study who is in the overvote!

        else:
            # find candidate name from column
            pd_series = row[rank_columns[rank_position]]
            cast_vote_column_name = pd_series.loc[pd_series == 1].index.tolist()[0]
            cast_vote = cast_vote_column_name.split(":")[-2]

        ranking_data[rank_position][voter_index] = cast_vote

# add the new columns
for rank_position in range(1,7):
    D1_voters_df[f"Rank {rank_position}"] = ranking_data[rank_position]

In [None]:
ranking_df = D1_voters_df[[f"Rank {rank_position}" for rank_position in range(1,7)]]
ranking_df.head()

Now it is in the correct format for VoteKit to read, so we can save it to a csv. Choose a file name that makes sense to you.

In [None]:
ranking_df.to_csv("your_file_name_here.csv")

Now that the csv is in the correct format for VoteKit, we can complete our cleaning using VoteKit's built in cleaning tools.

### Try it yourself:

Load the raw Portland profile from the csv you just saved using the `load_csv` function. Note, the ranking columns here need to be determined. Remember that Python starts indexing from 0. 

In [None]:
rank_cols = [] # type the numbers of the columns you need here, like [5,7,12,14]
portland_profile = load_csv("your_file_name_here.csv", rank_cols=rank_cols)

### Try it yourself:

Print out the list of candidates.

In [None]:
# Your code here

In the rules of Portland's election, any skipped positions and overvotes are ignored by the STV algorithm, and any candidates that were ranked below the position are moved up. The same thing occurs to three of the write in categories, but oddly enough, not to the "Uncertified Write-in" category.

### Try it yourself:

Using the `remove_cand` function, remove 'overvote', 'Write-in-120', 'Write-in-121', and 'Write-in-122' from the profile.

In [None]:
portland_profile_cands_removed = ####

We also have to handle one more item of cleaning. It is entirely possible that a voter listed the same candidate more than once on their ballot, which is not allowed. Portland chose to keep the first occurrence, and ignore any later occurrences, condensing any positions left empty as a result.

In [None]:
from votekit.cleaning import remove_repeated_candidates

portland_profile_pre_condensed = remove_repeated_candidates(portland_profile_cands_removed)

### Try it yourself:

Apply the `condense_profile` function to complete the cleaning.

In [None]:
cleaned_portland_profile = ####

Finally, the profile is cleaned and we can save it for analysis. We save it as a thing called a "pickle file," which is a way of storing Python variables. Choose a file name that makes sense to you!

In [None]:
cleaned_portland_profile.to_pickle("YOUR FILE NAME HERE.pkl")

We can now load the cleaned profile, and run an STV election for three seats to confirm that we cleaned it appropriately.

In [None]:
from votekit.pref_profile import PreferenceProfile

cleaned_portland_profile = PreferenceProfile.from_pickle("YOUR FILE NAME HERE.pkl") # change this to whatever you named your file

election = STV(cleaned_portland_profile, m=3)

Do we have the correct candidates? Do we have the same vote totals? Do we get the same STV winner set? The Election object, called `election` here, has lots of built in methods that allow us to check these stats.

In district 1, Avalos, Dunphy, and Smith were elected. The winners, first place vote distribution, and lots of other stats we can double check, are given [here](https://www.portland.gov/sites/default/files/2024/Portland-District-1-Certified-Abstract-Nov-2024.pdf).

In [None]:
print("Winners in order of election")
i=0
for cand_set in election.get_elected():
    for cand in cand_set:
        i+=1
        print(i, cand)

In [None]:
# threshold
print(f"Election Threshold: {election.threshold:,}")

In [None]:
from votekit.utils import first_place_votes

fpv_dict = first_place_votes(cleaned_portland_profile)
cands_sorted_by_fpv = sorted(zip(fpv_dict.keys(), fpv_dict.values()), # creates a list of tuples, (name, fpv)
                                reverse=True,  #decreasing order
                                key = lambda x: x[1], # sort by second element of tuple, which is fpv)
                                    )

print("Candidates in decreasing order of first-place votes.\n")
for cand, fpv in cands_sorted_by_fpv:
    print(cand, fpv)

Take a moment to verify these against the [official record](https://www.portland.gov/sites/default/files/2024/Portland-District-1-Certified-Abstract-Nov-2024.pdf).

## FairVote Repo

As one final source of data, [FairVote](https://fairvote.org/) maintains a repository of cast vote records from single and multi-winner ranked choice elections [here](https://dataverse.harvard.edu/dataverse/rcv_cvrs).

### Try it yourself:

1) Go to the [repo](https://dataverse.harvard.edu/dataverse/rcv_cvrs), and choose a single winner CVR. Download it as a csv, and place it into your working directory.

2) Load the csv using `load_csv`. Be sure to choose the right `rank_cols`.

3) Check the list of candidates, and if necessary, use `remove_cand` to get rid of non-candidates.

4) Apply `remove_repeated_cands` to the profile. For those of you who want a challenge, try to see if you can determine if there are any ballots that actually have repeated candidates.

5) Apply `condense_profile`.

6) Run an IRV election on the cleaned profile.

## If there is time

# TODO