## Setting the working directory

If you are running this notebook on Google Colab, you can use the code below to mount your Google Drive as the working directory.


```
from google.colab import drive
drive.mount('/content/drive')
```

The file structure for this and the other notebooks should be the following:
```
├── Main working directory
│   ├── pickles (folder)
│   │   ├── elections_cleaned.pkl (generated by cleaning pipeline)
│   │   ├── cleaned_press_directories.pkl (generated by cleaning pipeline)
│   ├── input (folder)
│   ├── output (folder)
│   ├── index.html
│   ├── main.js
│   ├── style.css
```



## Importing press and elections datasets
These datasets are created and [stored as pickles](https://docs.python.org/3/library/pickle.html) using elections_cleaning.ipynb and press_cleaning.ipynb, so those should be run first. Go to the README for more information.

In [None]:
import pickle


with open("pickles/cleaned_press_directories.pkl", 'rb') as f:
      press = pickle.load(f)

with open("pickles/elections_cleaned.pkl", 'rb') as f:
      elections = pickle.load(f)

## Press dataset JSON

This part below generates the JSON for the press dataset. The final structure of that dataset is this:

```
{
  "1846": {
    "aberdeenshire": {
      "press_data": {
        "liberal": 40.0,
        "neutral": 40.0,
        "conservative": 20.0
      },
      "majority": "multiple majority"
    },
    "angus": {
      "press_data": {
        "liberal": 66.67,
        "conservative": 33.33
      },
      "majority": "liberal"
    },
...
```
For each year, there's a list of counties; for each county, `press_data` collects the statistics, while `majority` indicates the majority press leaning for that county.


In [None]:
import pandas as pd
import json

press = press[press["S-POL"].notna()]




def calculate_relative_frequency(data_list):
    total_items = len(data_list)
    unique_items = set(data_list)
    frequency_dict = {}

    for item in unique_items:
        frequency = data_list.count(item)
        frequency_dict[item] = frequency

    return frequency_dict

def calculate_frequency(x):
    frequency_dict = {}
    total_count = len(x)
    threshold = 5  # percentage threshold

    for pol in x["S-POL"].unique():

        count = x[x["S-POL"] == pol]["S-TITLE"].nunique()
        frequency = round(count / total_count * 100, 2)
        if frequency < threshold:
            # Group categories under 5% into "other"
            frequency_dict["other"] = frequency_dict.get("other", 0) + frequency
            frequency_dict["other"] = round(frequency, 2)

        else:
            frequency_dict[pol] = round(frequency, 2)

    # Sort the dictionary based on values in descending order
    frequency_dict = dict(sorted(frequency_dict.items(), key=lambda item: item[1], reverse=True))
    max_value = max(frequency_dict.values())
    max_keys = sorted([key for key, value in frequency_dict.items() if value == max_value])
    unspecified = ["undefined", "no-politics", "independent", "neutral"]

    if len(max_keys) == 1:
        majority = max_keys[0]
    else:
        if all(key in unspecified for key in max_keys):
            majority = "undefined"
        else:
            majority = "multiple majority"
    county_dict = {
        "press_data": frequency_dict,
        "majority": majority
        }
    return county_dict


frequency = (
    press.groupby(["year", "map_county"])
    .apply(calculate_frequency)
    .reset_index(name="results")
)



final_dict = frequency.groupby('year')[['map_county','results']].apply(lambda x: x.set_index('map_county')["results"].to_dict()).to_dict()

for year in final_dict:
    for county_data in final_dict[year]:
        if final_dict[year][county_data]["majority"] not in most_common:
            if "&" in final_dict[year][county_data]["majority"]:
                final_dict[year][county_data]["majority"] = "multiple majority"
            final_dict[year][county_data]["majority"] = "other"



# Save the dictionary as a JSON file
with open("output/press_data.json", "w") as json_file:
    json.dump(final_dict, json_file, indent = 2)



## Elections dataset JSON

This part below generates the JSON for the elections dataset. The final structure of that dataset is this:

```
{
  "1847": {
    "aberdeenshire": {
      "data": {
        "Liberal Party (Original)": 1,
        "Conservative": 1
      },
      "majority": "multiple majority"
    },
    "berkshire": {
      "data": {
        "Conservative": 6,
        "Liberal Party (Original)": 6
      },
      "majority": "multiple majority"
    },
    "hampshire": {
      "data": {
        "Conservative": 11,
        "Liberal Party (Original)": 10
      },
      "majority": "Conservative"
    },
...
```
For each year, there's a list of counties; for each county, `data` collects the number of MPs elected per party, while `majority` indicates the party with most MPs for that county.

This process is slightly longer than the one for the press dataset since it's in four steps:
1. Use candidates vote shares to calculate the winner
2. Group winners for each constituency in the larger counties they belong to
3. Sort the grouped results in descending order
4. Iterate through all dictionaries to calculate the majority party

Some steps could be put together, I kept them separate for clarity.

In [None]:
# 1910 had two elections in the course of the year because the first one brought
# on a hung parliament. Each election had its own ID (636, 637) and I kept only
# the latter
elections = elections[elections["id"] != 636]


def process_unique(x):
    unique_values = x.unique()
    if len(unique_values) > 1:
        print("Multiple unique values found at index: cst={}, yr={}, index={}".format(x.name[0], x.name[1], x.index))
    return unique_values.tolist() if len(unique_values) > 1 else unique_values[0]


# Dictionary containing the matches between constituency names and counties
# I found while cleaning the elections dataset
county_mapping = dict(zip(elections['cst_n'], elections['map_county']))

sorted_mapping = {}

for key in sorted(county_mapping.keys()):
  sorted_mapping[key] = county_mapping[key]



def map_county(cst_n):
    try:
        return county_mapping[cst_n]
    except:
        print("unfound", cst_n)

# Dictionary that stores the county corresponding to a specific county id per
# each year
counties_equivalence = {}
for _, row in elections.iterrows():
    yr = row['yr']
    cst = row['cst']
    cst_n = row['cst_n']

    counties_equivalence.setdefault(yr, {})
    counties_equivalence[yr][cst] = cst_n


# Dictionary that stores the party corresponding to a specific party id per
# each year
parties_equivalence = {}
for _, row in elections.iterrows():
    yr = row['yr']
    pty = row['pty']
    pty_n = row['pty_n']

    parties_equivalence.setdefault(yr, {})
    parties_equivalence[yr][pty] = pty_n


# This part calculates the elected MP for each constituency and their party
# It does so by:
# 1. Taking all rows related to a constituency
# 2. Checking the number of seats up for election (variable mag)
# 3. Checking if the election was uncontested (variable vot1 == -992)

mps = {}
results = {}
results_copy = {}

for year in elections["yr"].unique():
    first = elections[elections["yr"] == year]

    results_election = {}

    for constituency in first["cst_n"].unique():
        results_election[constituency] = {}
        constituency_df = first[first["cst_n"] == constituency]
        seats = constituency_df["mag"].unique()
        uncontested = constituency_df["vot1"].unique()
        try:
            seats = int(seats)
        except:
            print("ERROR, seats is not an integer", seats, constituency, year)

        try:
            uncontested = int(uncontested)
        except:
            print("ERROR, uncontested is not an integer", uncontested, constituency, year)
            continue
        parties = constituency_df["pty_n"].unique()
        parties_running = len(parties)

        if seats == 1:
            if uncontested == -992:
                # This means that the elections is uncontested with a single seat
                if parties_running > 1:
                    print("ERROR, there are multiple parties running for one seat")

        if uncontested == -992:
            # This means that the elections is uncontested
            if parties_running > seats:
                # This would be an error, since it cannot be uncontested if there
                # are more parties than seats, in principle
                print("ERROR, there are more parties than seats in uncontested election", constituency, year)

            else:
                # Mostly used for 3 or 4 seats constituencies where there might be fewer
                # parties than seats and one party is gettint multiple seats
                for party in constituency_df["pty_n"]:

                    if party in results_election[constituency].keys():
                        results_election[constituency][party] += int(seats/len(constituency_df["pty_n"]))

                    else:
                        results_election[constituency][party] = int(seats/len(constituency_df["pty_n"]))

        else:
            if parties_running == 1:
                results_election[constituency][parties[0]] = int(seats)
            elif parties_running > 1:
                sorted_constituency_df = constituency_df.sort_values("cvs1", ascending=False)
                # Get the first three lines of the dataframe
                df = sorted_constituency_df.iloc[0:3]

                # Group the dataframe by column pty
                grouped_df = df.groupby("pty_n")

                # Get the frequency counts of each party
                party_counts = grouped_df["pty_n"].size()

                # Create a dictionary with the party names and their frequency counts
                party_dict = dict(zip(party_counts.index, party_counts))
                results_election[constituency] = party_dict
            else:
                print("ERROR, there are 0 or less parties running")
        if seats < 0:
            print(seats, constituency)
        else:
            if year in mps.keys():
                mps[year] += seats
            else:
                mps[year] = seats

    results[int(year)] = results_election



# After having calculated the results for each constituency, they are grouped
# following the county matchings found in the cleaning pipeline

for year_key, results_election in results.items():
    constituency_dictionary = {}

    for ungrouped_constituency in results_election:

        geoloc_county = map_county(ungrouped_constituency)
        if type(geoloc_county) == str:
            geoloc_county = geoloc_county.lower()
        if geoloc_county in constituency_dictionary.keys():
            for key, value in results_election[ungrouped_constituency].items():

                if key in constituency_dictionary[geoloc_county].keys():

                    constituency_dictionary[geoloc_county][key] += value
                else:

                    constituency_dictionary[geoloc_county][key] = value
        else:
            constituency_dictionary[geoloc_county] = results_election[ungrouped_constituency]


    results[year_key] = constituency_dictionary

# This last loop sorts the dictionary based on values in descending order. It is
# possible to do this in JavaScript as well

for year, counties in results.items():
    for county, county_results in results[year].items():
        results[year][county] = dict(sorted(results[year][county].items(), key=lambda item: item[1], reverse=True))





### Finding majorities
This is step 4, finding the majority for each constituency.

In [None]:
#This functions takes a dictionary and returns the higher values and relevant keys
def find_majority(dictionary):

  majority_seats = max(dictionary.values())
  majority_parties = []
  for key, value in dictionary.items():
    if value == majority_seats:
      majority_parties.append(key)

  return majority_parties

for year, value in results.items():
  for county, parties in value.items():
    majority_parties = find_majority(parties)
    if len(majority_parties) == 1:
      value[county] = {
          "data" : parties,
          "majority" : majority_parties[0]
          }

    else:
      value[county] = {
          "data" : parties,
          "majority" : "multiple majority"
      }

# Finally, I can export the JSON file
with open("output/elections.json", "w") as json_file:
    json.dump(results, json_file, indent = 2)