
---

# Motivation

### What is your dataset?

We selected a comprehensive dataset of European Parliament voting records. This dataset captures how each Member of the European Parliament (MEP) voted on proposed legislation, along with detailed information about the legislation itself, the MEPs, and their party affiliations. The dataset includes two main file types:

* **RCV (Roll Call Votes):** Contains individual vote records for each MEP, including their country, national party, and European Parliamentary Group (EPG).
* **Voted Docs:** Provides metadata on the legislation being voted on, including the vote date, outcome, and associated policy area.

The data is divided across four parliamentary sessions:

* EP6 (2004–2009)
* EP7 (2009–2014)
* EP8 (2014–2019)
* EP9 (2019–2022)

To enable longitudinal analysis, we merged data from all four sessions. This required extensive preprocessing to reconcile differences in schema and formatting across sessions.

### Why did you choose this particular dataset?

We chose this dataset because it offers a rich foundation for analyzing trends in European politics over time. It enables us to explore whether broader political shifts—such as increasing polarization or the rise of right-leaning ideologies—are observable in parliamentary voting behavior.

### What was your goal for the end user's experience?

Our goal is to provide users with an accessible interface to explore aggregated voting patterns and to present clear, data-driven insights into evolving political dynamics within the European Parliament. We aim to make it easy for users to identify trends related to polarization, such as which parties or policy areas are becoming more divisive, and which remain broadly supported. Ultimately, we want to support informed reflection on how shifts in ideology are shaping legislative decision-making at the EU level.

---


---

# Basic Stats

### Preprocessing

To analyze voting trends in the European Parliament across four sessions (EP6–EP9), we consolidated roll-call vote data (`RCV`) with metadata on each legislative item (`Voted Docs`). Due to inconsistencies across sessions in schema, date formats, and naming conventions, several preprocessing steps were necessary:

* **Name Standardization:** MEP names were cleaned to ensure consistency across sessions, accounting for formatting, punctuation, and Unicode differences.
* **Text Normalization:** Policy area fields and party/EPG names were cleaned to remove noise (e.g., punctuation, inconsistent capitalization).
* **Date Parsing:** Dates appeared in both `dd.mm.yyyy` and `yyyy-mm-dd` formats. A custom parser ensured accurate conversion to a uniform `datetime` format.
* **Schema Harmonization:** Since column names and vote identifiers varied by session (e.g., `euro_act_id` in EP6 vs `Vote ID` in later sessions), we created session-aware mappings to unify data.
* **Party Group Mapping:** EP political group (`EPG`) names were mapped to their common abbreviations (e.g., “Group of the European People’s Party...” → `EPP`) to allow for consistent comparison.
* **Missing Values:** Non-informative entries (e.g., empty vote fields or unknown policy areas) were filtered or imputed based on context.

These steps allowed us to generate a unified dataset spanning 2004–2022 with over **4.8 million individual votes**.

### Dataset Statistics

Key observations from the cleaned data:

* 📅 **Time Coverage:** The dataset spans four European Parliament terms:

  * EP6: 2004–2009
  * EP7: 2009–2014
  * EP8: 2014–2019
  * EP9: 2019–2022

* 🧑‍🤝‍🧑 **MEPs Involved:** \~1,300 unique MEPs from all 27 EU member states.

* 📄 **Votes Recorded:**

  * \~26,000 roll-call votes
  * \~4.8 million MEP-level voting entries (rows)

* 🗳️ **Vote Distribution (Sample):**

  | Vote      | Meaning                    | % of Total     |
  | --------- | -------------------------- | -------------- |
  | 1         | For                        | \~48%          |
  | 2         | Against                    | \~27%          |
  | 3         | Abstain                    | \~12%          |
  | 4 / 5 / 6 | Absent / No vote / Excused | \~13% combined |

* 🏛️ **Top Policy Areas:** After cleaning and consolidating, common topics included:

  * Environment
  * Budget and financial affairs
  * Foreign relations
  * Digital and industry regulation
  * Civil rights and rule of law

* 🧭 **EP Group Participation:** Most votes came from major groups like `EPP`, `S&D`, `Renew`, `ID`, `Greens/EFA`, and `The Left`.

---


In [193]:
import pandas as pd

def clean_name(first_name, last_name):
    import unicodedata
    
    if not isinstance(first_name, str):
        first_name = str(first_name) if first_name is not None else ""
    if not isinstance(last_name, str):
        last_name = str(last_name) if last_name is not None else ""
    
    first_name = first_name.lower().strip()
    last_name = last_name.lower().strip()
    
    def normalize_chars(text):
        text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII')
        return text
    
    first_name = normalize_chars(first_name)
    last_name = normalize_chars(last_name)
    
    for char in ['-', "'", "`", ".", ",", "&", "'"]:  # Added apostrophe variants
        first_name = first_name.replace(char, ' ')
        last_name = last_name.replace(char, ' ')
    
    while '  ' in first_name:
        first_name = first_name.replace('  ', ' ')
    while '  ' in last_name:
        last_name = last_name.replace('  ', ' ')
        
    first_name = ' '.join(word.capitalize() for word in first_name.split())
    last_name = ' '.join(word.capitalize() for word in last_name.split())
    
    full_name = f"{first_name} {last_name}".strip()
    
    return full_name

def clean_text(text):
    if not isinstance(text, str):
        return text
  
    text = text.lower()
    
    for char in ['&', ',', '-']:
        text = text.replace(char, ' ')
    
    text = text.replace(' and ', ' ')
    
    while '  ' in text:
        text = text.replace('  ', ' ')
    
    return text.strip()    

def process_ep_voting_data(rcv_files, voted_docs_files):

    if len(rcv_files) != len(voted_docs_files):
        raise ValueError("The lists of RCV files and Voted docs files must have the same length")
     
    all_data = []
    
    for i, (rcv_file, voted_doc_file) in enumerate(zip(rcv_files, voted_docs_files)):
        print(f"Processing files {i+1}/{len(rcv_files)}: {rcv_file} and {voted_doc_file}")
        
        if "EP6" in rcv_file:
            ep_session = "EP6"
            vote_start_index = 10
            rcv_data = pd.read_excel(rcv_file, header=1)
        elif "EP7" in rcv_file:
            ep_session = "EP7"
            vote_start_index = 9
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        elif "EP8" in rcv_file:
            ep_session = "EP8"
            vote_start_index = 9
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        elif "EP9" in rcv_file:
            ep_session = "EP9"
            vote_start_index = 10
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        else:
            ep_session = "Unknown"
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
            print("UNKNOWN SESSION")

        rcv_data = rcv_data.dropna(how='all')
        
        voted_docs = pd.read_excel(voted_doc_file)


        # Get vote columns headers (index)
        vote_columns = rcv_data.columns[vote_start_index:].tolist()
       
        votes_df = process_votes_ep(rcv_data, voted_docs, vote_columns, ep_session=ep_session)

        print(f"Should be around: {len(rcv_data) * len(voted_docs)}")
        print(f"Got: {len(votes_df)}")      

        # Add EP session information
        votes_df['ep_session'] = ep_session
        
        # Append to the list of results
        all_data.append(votes_df)
    
    # Concatenate all dataframes
    combined_df = pd.concat(all_data, ignore_index=True)
    
    # Perform final cleaning
    return combined_df


def process_votes_ep(rcv_data, voted_docs, vote_columns, ep_session = None):
    """Process voting data for EP7, EP8, EP9 sessions"""

    total_skipped = 0

    if ep_session == 'EP6':
        date = 'date'
        title = 'title'
        policy_area = 'main_policy_name'
        vote_id_key = 'euro_act_id'
        author = 'author_name'

        mep_id_key = 'WebisteEpID'

    else:
        date = 'Date'
        title = 'Title'
        policy_area = 'De'
        vote_id_key = 'Vote ID'
        author = 'Author'

        mep_id_key = 'WebisteEpID'

        if ep_session == 'EP7':
            mep_id_key = 'MEP ID'

        if ep_session == 'EP8':
            policy_area = "De/Policy area"

        elif ep_session == 'EP9':
            policy_area = 'Policy area'
  
    
    # Create a dictionary to map vote IDs to vote information
    vote_info = {}
    for _, row in voted_docs.iterrows():

        vote_info[str(row[vote_id_key])] = {
            'date': row[date],
            'title': row[title],
            'policy_area': row[policy_area],
            'author': author,
        }
    
    # Create a list to store results
    results = []
    
    # Process each MEP's votes
    for _, mep_row in rcv_data.iterrows():
        country = mep_row['Country']
        party = mep_row['Party']
        epg = mep_row['EPG']

        first_name = mep_row['Fname']
        last_name = mep_row['Lname']
        
        mep_id = mep_row[mep_id_key]
    
        # Process each vote for this MEP
        for vote_col in vote_columns:
            
            vote_col = str(vote_col)
            vote_code = f'{ep_session}-{vote_col}' 
            
            if vote_col not in vote_info:
                total_skipped += 1
                continue
            
            try:
                mep_vote = mep_row[str(vote_col)]
            except Exception as e:
                mep_vote = mep_row[int(vote_col)]
            
            if mep_vote == 0:
                continue
                
            info = vote_info[vote_col]
            
            results.append({
                'full name': clean_name(first_name, last_name),
                'country': country,
                'national_party': party,
                'epg': epg,
                'mep_id': mep_id,
                'vote_code': vote_code,
                'vote': mep_vote,
                'date': info['date'],
                'title': info['title'],
                'policy_area': clean_text(info['policy_area']),
            })
    
    print(f"Were not able to match: {total_skipped} votes")
    return pd.DataFrame(results)


In [20]:
import pandas as pd
import numpy as np

def parse_mixed_dates(date_str):
    
    if not isinstance(date_str, str):
        date_str = str(date_str)
    
    date_str = date_str.strip()
    
    try:
        # Check for dates with time component
        if " 00:00:00" in date_str:
            date_str = date_str.replace(" 00:00:00", "")

        if date_str == '18 ian 2007':
            return pd.to_datetime('2007-01-18')
        
        if '.' in date_str:
            # Parse as dd.mm.yyyy
            return pd.to_datetime(date_str, format='%d.%m.%Y')
        elif '-' in date_str:
            # Parse as yyyy-mm-dd
            return pd.to_datetime(date_str, format='%Y-%m-%d')
        elif '/' in date_str:
            try:
                # First try %d/%m/%Y (day/month/4-digit year)
                return pd.to_datetime(date_str, format='%d/%m/%Y')
            except ValueError:
                try:
                    # Then try %d/%m/%y (day/month/2-digit year)
                    return pd.to_datetime(date_str, format='%d/%m/%y')
                except ValueError:
                    # Fall back to pandas default parser with dayfirst=True
                    return pd.to_datetime(date_str, dayfirst=True)
        else:
            # For formats we haven't explicitly handled, use pandas' flexible parser
            return pd.to_datetime(date_str)
            
    except Exception as e:
        print(f"Error parsing date '{date_str}': {e}")
        return pd.NaT  # In case of error
    

def clean_combined_data(df):

    # Standardize date and extract year and month
    df['date'] = df['date'].astype(str).apply(parse_mixed_dates)
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month

    # Map EPG to ture names:
    epg_mapping = {
        "Group of the European People's Party (Christian Democrats)": 'EPP',
        "Group of the European People's Party (Christian Democrats) and European Democrats": 'EPP',
        'Socialist Group in the European Parliament': 'S&D',
        'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament': 'S&D',
        'Confederal Group of the European United Left - Nordic Green Left': 'The Left',
        'Group of the Greens/European Free Alliance': 'Greens/EFA',
        'Independence/Democracy Group': 'IDG',
        'Europe of freedom and democracy Group': 'IDG',
        'Europe of Freedom and Direct Democracy Group': 'IDG',
        'Europe of Nations and Freedom Group': 'ID',
        'European Conservatives and Reformists Group': 'ECR',
        'Non-attached Members': 'NI',
        'Group of the Alliance of Liberals and Democrats for Europe' : 'REG'
    }
    df['epg'] = df['epg'].replace(epg_mapping)


    # Merge PA 
    pa_mapping = {
        'regioanal development': 'regional development',
        'economic monetary affairs': 'economics',
        'juridical affairs': 'legal affairs'
    }
    df['policy_area'] = df['policy_area'].replace(pa_mapping)

    # Data type transformations
    df['epg'] = df['epg'].astype(str)
    df['year'] = df['year'].astype(float).astype(int)   
    df['vote'] = df['vote'].astype(float).astype(int)   
    
    return df

In [3]:

voted_docs_files = ["VoteWatch-EP-voting-data_2004-2022/EP6_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP7_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP8_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP9_Voted docs.xlsx"]
rcv_files = ["VoteWatch-EP-voting-data_2004-2022/EP6_RCVs_2022_06_13.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP7_RCVs_2014_06_19.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP8_RCVs_2019_06_25.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP9_RCVs_2022_06_22.xlsx"]

df = process_ep_voting_data(rcv_files, voted_docs_files)

print('Saving uncleaned data')

output_file = "ep_voting_data_combined_raw.csv"
df.to_csv(output_file, index=False)

print('Cleaning data')
df = clean_combined_data(df)

print('Saving cleaned data')

output_file = "ep_voting_data_combined_clean.csv"
df.to_csv(output_file, index=False)


Processing files 1/4: VoteWatch-EP-voting-data_2004-2022/EP6_RCVs_2022_06_13.xlsx and VoteWatch-EP-voting-data_2004-2022/EP6_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 5827060
Got: 4759840
Processing files 2/4: VoteWatch-EP-voting-data_2004-2022/EP7_RCVs_2014_06_19.xlsx and VoteWatch-EP-voting-data_2004-2022/EP7_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 5937733
Got: 5233859
Processing files 3/4: VoteWatch-EP-voting-data_2004-2022/EP8_RCVs_2019_06_25.xlsx and VoteWatch-EP-voting-data_2004-2022/EP8_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 8796216
Got: 7696506
Processing files 4/4: VoteWatch-EP-voting-data_2004-2022/EP9_RCVs_2022_06_22.xlsx and VoteWatch-EP-voting-data_2004-2022/EP9_Voted docs.xlsx


  warn(msg)


Were not able to match: 0 votes
Should be around: 10915249
Got: 9520348
Saving uncleaned data
Cleaning data


In [190]:
df['year'].value_counts()

year
2021    3901690
2020    3606332
2018    1760795
2015    1681323
2016    1631643
2022    1624405
2019    1551473
2013    1490385
2014    1357096
2017    1247199
2008    1223553
2009    1059693
2007    1017630
2011     828518
2005     782272
2012     775963
2010     770885
2006     746226
2004     153472
Name: count, dtype: int64

In [67]:
import pandas as pd

try:
    headers_list = df.columns.tolist()
except NameError:
    df = pd.read_csv('ep_voting_data_combined_clean.csv', dtype=str)
    headers_list = df.columns.tolist()

print("Column headers:")
print(headers_list)


Column headers:
['full name', 'country', 'national_party', 'epg', 'mep_id', 'vote_code', 'vote', 'date', 'title', 'policy_area', 'ep_session', 'year', 'month', 'policy_area_cleaned']


In [231]:
from bokeh.models import ColumnDataSource, HTMLTemplateFormatter, TableColumn
from bokeh.models.widgets import DataTable
from bokeh.io import output_file, save
import pandas as pd

# --- EPG Info ---
epg_info = {
    'NI': {
        'name': 'Non-Attached Members', 
        'color': '#808080',
        'ideology': 'Mixed/Unaligned', 
        'abbr': 'NI'
    },
    'The Left': {
        'name': 'The Left', 
        'color': '#B71C1C',
        'ideology': 'Left-wing to Far-left', 
        'abbr': 'The Left'
    },
    'S&D': {
        'name': 'Progressive Alliance of Socialists and Democrats', 
        'color': '#D32F2F',
        'ideology': 'Center-left to Left-wing', 
        'abbr': 'S&D'
    },
    'IDG': {
        'name': 'Identity and Democracy Group', 
        'color': '#1A237E',
        'ideology': 'Right-wing to Far-right', 
        'abbr': 'IDG'
    },
    'Greens/EFA': {
        'name': 'Greens/European Free Alliance', 
        'color': '#43A047',
        'ideology': 'Green politics, Regionalist', 
        'abbr': 'Greens/EFA'
    },
    'REG': {
        'name': 'Renew Europe Group', 
        'color': '#039BE5',
        'ideology': 'Centrist, Liberal', 
        'abbr': 'REG'
    },
    'EPP': {
        'name': 'European People\'s Party', 
        'color': '#0D47A1',
        'ideology': 'Center-right to Right-wing', 
        'abbr': 'EPP'
    }
}

# --- Filter and count MEPs ---
df_2015 = df[(df['year'] == 2015) & (df['epg'].isin(epg_info.keys()))]
unique_meps_2015 = df_2015[['mep_id', 'epg']].drop_duplicates()
total_meps = unique_meps_2015['mep_id'].nunique()
mep_counts = unique_meps_2015.groupby('epg')['mep_id'].nunique().to_dict()

# --- Build table data ---
rows = []
for epg, info in epg_info.items():
    count = mep_counts.get(epg, 0)
    percent = (count / total_meps * 100) if total_meps > 0 else 0
    rows.append({
        'Color': info['color'],
        'ColorBox': f"<div style='width:20px; height:20px; background-color:{info['color']}; border:1px solid #000;'></div>",
        'Abbreviation': info['abbr'],
        'Full Name': f"<span style='color:{info['color']}; font-weight:bold;'>{info['name']}</span>",
        'Ideology': info['ideology'],
        'MEPs (2015)': f"{percent:.2f}%"
    })

df_table = pd.DataFrame(rows).sort_values('Color')
source = ColumnDataSource(df_table)

# --- Define Columns with HTML formatting ---
columns = [
    TableColumn(field='ColorBox', title='', formatter=HTMLTemplateFormatter(template='<%= value %>')),
    TableColumn(field='Abbreviation', title='Abbr.'),
    TableColumn(field='Full Name', title='Full Name', formatter=HTMLTemplateFormatter(template='<%= value %>')),
    TableColumn(field='Ideology', title='Ideology'),
    TableColumn(field='MEPs (2015)', title='MEPs (2015)')
]

# --- Create Bokeh DataTable ---
data_table = DataTable(source=source, columns=columns, width=1000, height=280, index_position=None)

# --- Output HTML ---
output_file("epg_table_bokeh.html")
save(data_table)

print("✅ Saved: plots/epg_table_bokeh.html")


✅ Saved: epg_table_bokeh.html


In [198]:
df = df[df['year'] != (2004 or 2022)]

## Create simmilarity matrix between EPGs for each year and policy area

In [199]:
from itertools import combinations

def agreement_index(votes1, votes2):

    total_votes_1 = sum(votes1)
    total_votes_2 = sum(votes2)
    
    total_difference = 0
    
    for i in range(3): #Consider 3 vote types, 1 (yes), 2 (no), 3 (abstain)

        pct1 = votes1[i] / total_votes_1
        pct2 = votes2[i] / total_votes_2
        
        # Add minimum percentage as the agreement for this vote type
        total_difference += abs(pct1 - pct2)
    
    return 1 - (total_difference / 2)

def agreement_index(votes1, votes2):
    """
    Rice index
    """

    total_votes = sum(votes1) + sum(votes2)
    
    total_yes = votes1[0] + votes2[0]
    total_no = votes1[1] + votes2[1]
    
    
    return abs(total_yes - total_no) / total_votes



In [200]:

# Filter out EPG that appear in all years
years = df['year'].unique()
epgs_by_year = [set(df[df['year'] == year]['epg'].dropna().unique()) for year in years]
common_epgs = set.intersection(*epgs_by_year)


# Modified code to use the new similarity function
df_year = df.groupby(['year', 'policy_area'])
epgs = sorted(list(common_epgs))
similarity_matrices = {}


# Track missing data
missing_data_count = 0
total_combinations = 0

for name, group in df_year:
    year, policy_area = name 
    year = int(year)
    sim_matrix = pd.DataFrame(index=epgs, columns=epgs)
    
    # Track missing data for this year/policy_area
    missing_in_current = 0
    total_in_current = 0
    
    # Calculate similarities between all EPG pairs
    for epg1, epg2 in combinations(epgs, 2):
        total_combinations += 1
        total_in_current += 1
        
        ep1_votes_series = group[group['epg'] == epg1]['vote'].value_counts()
        ep2_votes_series = group[group['epg'] == epg2]['vote'].value_counts()

        # Initialize arrays with zeros for vote types 1, 2, and 3
        ep1_votes = [0, 0, 0]  # Index 0 for vote value 1, index 1 for vote value 2, etc.
        ep2_votes = [0, 0, 0]

        # Fill in the counts from the Series, handling both int and float vote types
        for vote_type, count in ep1_votes_series.items():
            vote_index = int(float(vote_type)) - 1  # Convert vote type (1,2,3) to index (0,1,2)
            if 0 <= vote_index <= 2:  # Only include vote types 1, 2, and 3
                ep1_votes[vote_index] = count

        for vote_type, count in ep2_votes_series.items():
            vote_index = int(float(vote_type)) - 1
            if 0 <= vote_index <= 2:
                ep2_votes[vote_index] = count

        
        # Check if any relevant votes exist for both EPGs
        has_votes_1 = sum(ep1_votes) != 0
        has_votes_2 = sum(ep2_votes) != 0
        
        if not has_votes_1 or not has_votes_2:
            missing_data_count += 1
            missing_in_current += 1
            similarity = 0  # No relevant votes, similarity is 0
        else:
            similarity = agreement_index(ep1_votes, ep2_votes)
            
        sim_matrix.loc[epg1, epg2] = similarity
        sim_matrix.loc[epg2, epg1] = similarity  # Matrix is symmetric
    
    # Set diagonal to 1 (self-similarity)
    for epg in epgs:
        sim_matrix.loc[epg, epg] = 1.0
            
    # Store the matrix
    if policy_area not in similarity_matrices:
        similarity_matrices[policy_area] = {}
    similarity_matrices[policy_area][year] = sim_matrix
    
    # Print summary for this year/policy_area
    if missing_in_current > 0:
        print(f"Year {year}, Policy Area '{policy_area}': {missing_in_current}/{total_in_current} EPG pairs have missing vote data ({missing_in_current/total_in_current:.1%})")

# Print overall summary
print(f"\nOverall: {missing_data_count}/{total_combinations} EPG pairs have missing vote data ({missing_data_count/total_combinations:.1%})")


Overall: 0/7791 EPG pairs have missing vote data (0.0%)


# Analyse sim matrix

In [201]:
import pandas as pd
import numpy as np
from itertools import combinations

# Function to find most and least related EPG pairs from a similarity matrix
def get_most_least_related(similarity_matrix):
    # Create a copy to avoid modifying the original
    sim_mat = similarity_matrix.copy()
    
    # Exclude NI group if it exists in the matrix
    if 'NI' in sim_mat.index:
        sim_mat = sim_mat.drop('NI', axis=0)
    if 'NI' in sim_mat.columns:
        sim_mat = sim_mat.drop('NI', axis=1)
    
    # If matrix is empty after dropping NI, return empty results
    if sim_mat.empty:
        return {
            'most_related': {'pairs': [], 'similarity': np.nan},
            'least_related': {'pairs': [], 'similarity': np.nan}
        }
    
    # Set diagonal to NaN to exclude self-relationships
    np.fill_diagonal(sim_mat.values, np.nan)
    
    # Find max similarity (most related)
    max_val = sim_mat.max().max()
    max_idx = np.where(sim_mat == max_val)
    max_pairs = list(zip(sim_mat.index[max_idx[0]], sim_mat.columns[max_idx[1]]))
    
    # Find min similarity (least related)
    min_val = sim_mat.min().min()
    min_idx = np.where(sim_mat == min_val)
    min_pairs = list(zip(sim_mat.index[min_idx[0]], sim_mat.columns[min_idx[1]]))
    
    # Remove duplicates due to symmetry
    max_pairs = [tuple(sorted(pair)) for pair in max_pairs]
    min_pairs = [tuple(sorted(pair)) for pair in min_pairs]
    max_pairs = list(dict.fromkeys(max_pairs))
    min_pairs = list(dict.fromkeys(min_pairs))
    
    return {
        'most_related': {'pairs': max_pairs, 'similarity': max_val},
        'least_related': {'pairs': min_pairs, 'similarity': min_val}
    }

# Create a dataframe to store results
policy_results = []

# Dictionaries to store yearly aggregated matrices across policy areas
yearly_matrices = {}
all_years = set()

# Analyze each policy area
for policy_area, years_data in similarity_matrices.items():
    # First analyze each year individually
    for year, sim_matrix in sorted(years_data.items()):
        # Convert year to int if it's not already (to ensure consistent handling)
        year_val = int(year) if not isinstance(year, int) else year
        
        # Track all years for later use
        all_years.add(year_val)
        
        # Add to yearly aggregated matrices
        if year_val not in yearly_matrices:
            yearly_matrices[year_val] = sim_matrix.copy()
        else:
            yearly_matrices[year_val] += sim_matrix
        
        relations = get_most_least_related(sim_matrix)
        
        # Skip if no valid relations (e.g., only NI group was present)
        if np.isnan(relations['most_related']['similarity']):
            continue
        
        # Format the results
        most_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']])
        least_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']])
        
        policy_results.append({
            'Policy Area': policy_area,
            'Year': str(year_val),  # Convert to string for consistent handling
            'Most Related EPGs': most_related_pairs,
            'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
            'Least Related EPGs': least_related_pairs,
            'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
        })
    
    # Calculate the average similarity matrix across all years for this policy area
    avg_matrix = None
    years_count = 0
    
    for year, matrix in years_data.items():
        if avg_matrix is None:
            avg_matrix = matrix.copy()
        else:
            avg_matrix += matrix
        years_count += 1
    
    if avg_matrix is not None and years_count > 0:
        avg_matrix = avg_matrix / years_count
        
        # Get the most and least related from the average matrix
        relations = get_most_least_related(avg_matrix)
        
        # Skip if no valid relations
        if np.isnan(relations['most_related']['similarity']):
            continue
        
        # Format the results
        most_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']])
        least_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']])
        
        policy_results.append({
            'Policy Area': policy_area,
            'Year': 'TOTAL',
            'Most Related EPGs': most_related_pairs,
            'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
            'Least Related EPGs': least_related_pairs,
            'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
        })

# Calculate the average for each year across all policy areas
yearly_results = []
policy_areas_count = len(similarity_matrices)

for year in sorted(yearly_matrices.keys()):
    yearly_avg_matrix = yearly_matrices[year] / policy_areas_count
    relations = get_most_least_related(yearly_avg_matrix)
    
    # Skip if no valid relations
    if np.isnan(relations['most_related']['similarity']):
        continue
    
    # Format the results
    most_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']])
    least_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']])
    
    yearly_results.append({
        'Year': str(year),  # Convert to string
        'Most Related EPGs': most_related_pairs,
        'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
        'Least Related EPGs': least_related_pairs,
        'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
    })

# Calculate overall total across all years and policy areas
overall_matrix = None
matrix_count = 0

for policy_area, years_data in similarity_matrices.items():
    for year, matrix in years_data.items():
        if overall_matrix is None:
            overall_matrix = matrix.copy()
        else:
            overall_matrix += matrix
        matrix_count += 1

if overall_matrix is not None and matrix_count > 0:
    overall_matrix = overall_matrix / matrix_count
    
    # Get the most and least related from the overall matrix
    relations = get_most_least_related(overall_matrix)
    
    # Add to yearly results as a total row
    if not np.isnan(relations['most_related']['similarity']):
        yearly_results.append({
            'Year': 'TOTAL',
            'Most Related EPGs': ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']]),
            'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
            'Least Related EPGs': ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']]),
            'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
        })

# Convert to dataframes
policy_df = pd.DataFrame(policy_results)
yearly_df = pd.DataFrame(yearly_results)

# Create a custom sorting order to ensure TOTAL is at the end
def custom_sort(df, col):
    # Extract the non-TOTAL values and sort them numerically
    non_total = [x for x in df[col].unique() if x != 'TOTAL']
    # Convert to ints for proper numerical sorting, then back to strings
    sorted_non_total = [str(x) for x in sorted([int(x) for x in non_total])]
    # Create the full sort order with TOTAL at the end
    sort_order = {val: i for i, val in enumerate(sorted_non_total + ['TOTAL'])}
    # Apply the sort
    return df.sort_values(by=col, key=lambda x: x.map(sort_order))

# Apply custom sorting
if not policy_df.empty:
    policy_df = policy_df.sort_values('Policy Area')
    policy_df = custom_sort(policy_df, 'Year')

if not yearly_df.empty:
    yearly_df = custom_sort(yearly_df, 'Year')

# Display the yearly aggregated results
print("Yearly EPG Relationship Analysis (All Policy Areas Combined)")
print("=" * 100)
print("Note: NI (Non-Attached Members) excluded from analysis")
print("-" * 100)
if not yearly_df.empty:
    print(yearly_df.to_string(index=False))
else:
    print("No data available after excluding NI group")
print("\n\n")

# Display the policy-specific results
print("EPG Relationship Analysis by Policy Area and Year")
print("=" * 100)
print("Note: NI (Non-Attached Members) excluded from analysis")
print("-" * 100)

if not policy_df.empty:
    for policy_area, group in policy_df.groupby('Policy Area'):
        print(f"\nPolicy Area: {policy_area}")
        print("-" * 100)
        
        # Apply custom sorting within each group to ensure TOTAL is at the end
        group = custom_sort(group, 'Year')
        
        # Format for display
        display_df = group[['Year', 'Most Related EPGs', 'Most Related Similarity', 
                            'Least Related EPGs', 'Least Related Similarity']]
        
        # Convert to string representation with aligned columns
        print(display_df.to_string(index=False))
else:
    print("No data available after excluding NI group")

Yearly EPG Relationship Analysis (All Policy Areas Combined)
Note: NI (Non-Attached Members) excluded from analysis
----------------------------------------------------------------------------------------------------
 Year   Most Related EPGs Most Related Similarity Least Related EPGs Least Related Similarity
 2005 Greens/EFA-The Left                   0.341       IDG-The Left                    0.196
 2006 Greens/EFA-The Left                   0.407       IDG-The Left                    0.167
 2007 Greens/EFA-The Left                   0.288       IDG-The Left                    0.161
 2008      Greens/EFA-REG                   0.448       IDG-The Left                    0.273
 2009      Greens/EFA-S&D                   0.554       IDG-The Left                    0.283
 2010      Greens/EFA-S&D                   0.463       IDG-The Left                    0.263
 2011             REG-S&D                   0.494       IDG-The Left                    0.177
 2012             REG-S&D      

## Create yes vote percentage matrix over EPGs and policy area for each year

In [202]:
# Step 1: Filter votes with codes 1, 2, 3
df_relevant_votes = df[df['vote'].isin([1, 2, 3])]

# Step 2: Get EPGs present in all years
years = df_relevant_votes['year'].unique()
epgs_by_year = [set(df_relevant_votes[df_relevant_votes['year'] == year]['epg'].dropna().unique()) for year in years]
common_epgs = set.intersection(*epgs_by_year)
epgs = sorted(list(common_epgs))

# Step 3: Get policy areas present in all years
policy_areas_by_year = [set(df_relevant_votes[df_relevant_votes['year'] == year]['policy_area'].dropna().unique()) for year in years]
common_policy_areas = set.intersection(*policy_areas_by_year)
policy_areas = sorted(list(common_policy_areas))

# Step 4: Filter the DataFrame using both EPG and Policy Area
df_filtered = df_relevant_votes[
    (df_relevant_votes['epg'].isin(common_epgs)) &
    (df_relevant_votes['policy_area'].isin(common_policy_areas))
]

# Step 5: Group by year and calculate % of 'yes' (vote==1) per (EPG, policy_area)
df_year = df_filtered.groupby('year')

yes_percentage_matrices = {}

for year, group in df_year:
    yes_percentage_matrix = pd.DataFrame(index=['combined'] + epgs, columns=['combined'] + policy_areas, dtype=float)

    # Combined EPG (i.e. 'combined' row): average for all MEPs by policy area
    for policy_area in policy_areas:
        subset = group[group['policy_area'] == policy_area]
        votes_series = subset['vote'].value_counts()
        yes_votes = votes_series.get(1, 0)
        total_votes = votes_series.sum()
        percentage = yes_votes / total_votes if total_votes > 0 else np.nan
        yes_percentage_matrix.loc['combined', policy_area] = percentage

    # Combined EPG + Combined Policy Area (bottom-right cell)
    subset = group
    votes_series = subset['vote'].value_counts()
    yes_votes = votes_series.get(1, 0)
    total_votes = votes_series.sum()
    percentage = yes_votes / total_votes if total_votes > 0 else np.nan
    yes_percentage_matrix.loc['combined', 'combined'] = percentage

    # For each EPG
    for epg in epgs:
        for policy_area in policy_areas:
            subset = group[(group['epg'] == epg) & (group['policy_area'] == policy_area)]
            votes_series = subset['vote'].value_counts()
            yes_votes = votes_series.get(1, 0)
            total_votes = votes_series.sum()
            percentage = yes_votes / total_votes if total_votes > 0 else np.nan
            yes_percentage_matrix.loc[epg, policy_area] = percentage

        # EPG row's 'combined' column
        subset = group[group['epg'] == epg]
        votes_series = subset['vote'].value_counts()
        yes_votes = votes_series.get(1, 0)
        total_votes = votes_series.sum()
        percentage = yes_votes / total_votes if total_votes > 0 else np.nan
        yes_percentage_matrix.loc[epg, 'combined'] = percentage

    yes_percentage_matrices[year] = yes_percentage_matrix


# Epic heatmap with EPG vs Policy Area over time

In [248]:
import pandas as pd
import numpy as np
from bokeh.plotting import figure, output_file, save
from bokeh.layouts import column, row
from bokeh.models import (ColumnDataSource, LinearColorMapper, ColorBar, 
                          HoverTool, Slider, CheckboxGroup, CustomJS, 
                          BasicTicker, PrintfTickFormatter, Button, 
                          ColumnDataSource, Toggle)
from bokeh.palettes import RdBu11
from bokeh.io import output_notebook, show

# Create a combined matrix for all years
# First, get all the votes and total counts for each cell
total_yes_counts = {}
total_vote_counts = {}

# Initialize the matrix with the same structure as the year matrices
first_year = list(yes_percentage_matrices.keys())[0]
combined_matrix = pd.DataFrame(index=yes_percentage_matrices[first_year].index,
                               columns=yes_percentage_matrices[first_year].columns,
                               dtype=float)

# For each year, extract the raw vote counts to get accurate aggregations
for year, df_group in df_year:
    for epg in epgs + ['combined']:
        for policy_area in policy_areas + ['combined']:
            # Key for tracking totals
            cell_key = (epg, policy_area)
            
            # Get the appropriate subset of data
            if epg == 'combined' and policy_area == 'combined':
                subset = df_group
            elif epg == 'combined':
                subset = df_group[df_group['policy_area'] == policy_area]
            elif policy_area == 'combined':
                subset = df_group[df_group['epg'] == epg]
            else:
                subset = df_group[(df_group['epg'] == epg) & (df_group['policy_area'] == policy_area)]
            
            # Count votes
            votes_series = subset['vote'].value_counts()
            yes_votes = votes_series.get(1, 0)
            total_votes = votes_series.sum()
            
            # Add to running totals
            if cell_key not in total_yes_counts:
                total_yes_counts[cell_key] = 0
                total_vote_counts[cell_key] = 0
            
            total_yes_counts[cell_key] += yes_votes
            total_vote_counts[cell_key] += total_votes

# Calculate percentages for the combined matrix
for epg in epgs + ['combined']:
    for policy_area in policy_areas + ['combined']:
        cell_key = (epg, policy_area)
        if total_vote_counts[cell_key] > 0:
            combined_matrix.loc[epg, policy_area] = total_yes_counts[cell_key] / total_vote_counts[cell_key]
        else:
            combined_matrix.loc[epg, policy_area] = np.nan

# Create a separate total matrix without 'combined' row and column
total_matrix = combined_matrix.copy()
if 'combined' in total_matrix.index:
    total_matrix = total_matrix.drop('combined', axis=0)
if 'combined' in total_matrix.columns:
    total_matrix = total_matrix.drop('combined', axis=1)

# Add the matrices to our dictionary
yes_percentage_matrices['all_years'] = combined_matrix
yes_percentage_matrices['total'] = total_matrix


In [249]:
print(common_epgs)

epg_info = {
    'NI': {
        'name': 'Non-Attached Members', 
        'color': '#808080',
        'ideology': 'Mixed/Unaligned', 
        'abbr': 'NI'
    },
    'The Left': {
        'name': 'The Left', 
        'color': '#B71C1C',
        'ideology': 'Left-wing to Far-left', 
        'abbr': 'The Left'
    },
    'S&D': {
        'name': 'Progressive Alliance of Socialists and Democrats', 
        'color': '#D32F2F',
        'ideology': 'Center-left to Left-wing', 
        'abbr': 'S&D'
    },
    'IDG': {
        'name': 'Identity and Democracy Group', 
        'color': '#1A237E',
        'ideology': 'Right-wing to Far-right', 
        'abbr': 'IDG'
    },
    'Greens/EFA': {
        'name': 'Greens/European Free Alliance', 
        'color': '#43A047',
        'ideology': 'Green politics, Regionalist', 
        'abbr': 'Greens/EFA'
    },
    'REG': {
        'name': 'Renew Europe Group', 
        'color': '#039BE5',
        'ideology': 'Centrist, Liberal', 
        'abbr': 'REG'
    },
    'EPP': {
        'name': 'European People\'s Party', 
        'color': '#0D47A1',
        'ideology': 'Center-right to Right-wing', 
        'abbr': 'EPP'
    }
}

{'NI', 'The Left', 'S&D', 'IDG', 'Greens/EFA', 'REG', 'EPP'}


Yearly heatmap saved as 'ep_voting_yearly_heatmap.html'
Aggregated heatmap saved as 'ep_voting_aggregated_heatmap.html'


## Create PCA, tranform with procrustes and animate over time for paramter policy area

## Aggregated over all policy areas

In [250]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from scipy.spatial import procrustes
from bokeh.plotting import figure, save, output_file
from bokeh.models import ColumnDataSource, LabelSet, Slider, CustomJS, Button, Range1d, Legend, LegendItem
from bokeh.layouts import column, row

# Set output to an HTML file
output_file("plots/epg_clustering_all_policies_merged.html")

# Define EPG color scheme
epg_info = {
    'NI': {
        'name': 'Non-Attached Members', 
        'color': '#808080',
        'ideology': 'Mixed/Unaligned', 
        'abbr': 'NI'
    },
    'The Left': {
        'name': 'The Left', 
        'color': '#B71C1C',
        'ideology': 'Left-wing to Far-left', 
        'abbr': 'The Left'
    },
    'S&D': {
        'name': 'Progressive Alliance of Socialists and Democrats', 
        'color': '#D32F2F',
        'ideology': 'Center-left to Left-wing', 
        'abbr': 'S&D'
    },
    'IDG': {
        'name': 'Identity and Democracy Group', 
        'color': '#1A237E',
        'ideology': 'Right-wing to Far-right', 
        'abbr': 'IDG'
    },
    'Greens/EFA': {
        'name': 'Greens/European Free Alliance', 
        'color': '#43A047',
        'ideology': 'Green politics, Regionalist', 
        'abbr': 'Greens/EFA'
    },
    'REG': {
        'name': 'Renew Europe Group', 
        'color': '#039BE5',
        'ideology': 'Centrist, Liberal', 
        'abbr': 'REG'
    },
    'EPP': {
        'name': 'European People\'s Party', 
        'color': '#0D47A1',
        'ideology': 'Center-right to Right-wing', 
        'abbr': 'EPP'
    }
}

# Merge data from all policy areas for each year
merged_similarity_matrices = {}

# Get all unique years across all policy areas
all_years = set()
for policy_area in similarity_matrices:
    all_years.update(similarity_matrices[policy_area].keys())
all_years = sorted(all_years)

# Get all unique EPGs
all_epgs = set()
for policy_area in similarity_matrices:
    for year in similarity_matrices[policy_area]:
        all_epgs.update(similarity_matrices[policy_area][year].index)
all_epgs = sorted(list(all_epgs))

# Merge data for each year by averaging similarity scores across policy areas
for year in all_years:
    # Create a DataFrame with zeros for all EPG pairs - explicitly use float dtype
    merged_matrix = pd.DataFrame(0.0, index=all_epgs, columns=all_epgs, dtype=float)
    count_matrix = pd.DataFrame(0, index=all_epgs, columns=all_epgs, dtype=int)
    
    # Add similarity scores from each policy area
    for policy_area in similarity_matrices:
        if year in similarity_matrices[policy_area]:
            policy_matrix = similarity_matrices[policy_area][year]
            for epg1 in policy_matrix.index:
                for epg2 in policy_matrix.columns:
                    if epg1 in merged_matrix.index and epg2 in merged_matrix.columns:
                        # Explicitly convert to float to avoid dtype issues
                        value = float(policy_matrix.loc[epg1, epg2])
                        merged_matrix.loc[epg1, epg2] += value
                        count_matrix.loc[epg1, epg2] += 1
    
    # Average the similarity scores
    for epg1 in merged_matrix.index:
        for epg2 in merged_matrix.columns:
            if count_matrix.loc[epg1, epg2] > 0:
                merged_matrix.loc[epg1, epg2] /= float(count_matrix.loc[epg1, epg2])
            else:
                # No data for this pair, set to 0 if different EPGs, 1 if same EPG
                merged_matrix.loc[epg1, epg2] = 1.0 if epg1 == epg2 else 0.0
    
    # Store the merged matrix
    merged_similarity_matrices[year] = merged_matrix

# Get years with enough data
years = [year for year in all_years if len(merged_similarity_matrices[year]) >= 3]

# Make sure we have at least one valid year
if not years:
    raise ValueError("No years have enough EPGs for visualization")

# Get EPGs from the first year
epgs = list(merged_similarity_matrices[years[0]].index)

# Function to get color for an EPG (using the predefined colors or default to gray)
def get_epg_color(epg):
    if epg in epg_info:
        return epg_info[epg]['color']
    return '#CCCCCC'  # Default gray for unknown EPGs

# Function to get full name for an EPG
def get_epg_full_name(epg):
    if epg in epg_info:
        return epg_info[epg]['name']
    return epg  # Default to the abbreviation if not found

# Function to get ideology for an EPG
def get_epg_ideology(epg):
    if epg in epg_info:
        return epg_info[epg]['ideology']
    return 'Unknown'  # Default to Unknown if not found

# Function to perform dimensionality reduction on a similarity matrix
def get_coordinates(similarity_matrix, method='pca'):
    # Convert similarity to distance matrix
    distance_matrix = 1 - similarity_matrix
    
    # Convert to numpy array
    X = distance_matrix.values.astype(float)  # Ensure float data type
    
    # Apply dimensionality reduction
    if method == 'pca':
        model = PCA(n_components=2)
        result = model.fit_transform(X)
    else:
        raise ValueError(f"Unknown method: {method}")
    
    # Create DataFrame with results - use explicit dtypes
    df_result = pd.DataFrame({
        'x': result[:, 0].astype(float),
        'y': result[:, 1].astype(float),
        'epg': distance_matrix.index.tolist(),
        'color': [get_epg_color(epg) for epg in distance_matrix.index],
        'full_name': [get_epg_full_name(epg) for epg in distance_matrix.index],
        'ideology': [get_epg_ideology(epg) for epg in distance_matrix.index]
    })
    
    return df_result

# Function to align coordinates with reference using Procrustes analysis
def align_coordinates(target_df, reference_df):
    # Get common EPGs
    common_epgs = set(target_df['epg']).intersection(set(reference_df['epg']))
    
    if len(common_epgs) < 2:
        # Not enough common points to align
        return target_df
    
    # Extract coordinates for common EPGs
    target_coords = np.array([target_df[target_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
    reference_coords = np.array([reference_df[reference_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
    
    # Perform Procrustes analysis to align target to reference
    mtx1, mtx2, disparity = procrustes(reference_coords, target_coords)
    
    # Create transformation matrix (scale, rotation, reflection)
    scale = np.sqrt(np.sum(mtx1[0]**2)) / np.sqrt(np.sum(target_coords[0]**2))
    
    # Apply transformation to all points in target_df
    result_df = target_df.copy()
    coords = result_df[['x', 'y']].values.astype(float)  # Ensure float data type
    
    # Scale and center (simplified Procrustes)
    coords_scaled = coords * scale
    
    # Get centroids
    target_centroid = np.mean(target_coords, axis=0)
    reference_centroid = np.mean(reference_coords, axis=0)
    
    # Translate
    coords_transformed = coords_scaled - target_centroid + reference_centroid
    
    # Update dataframe
    result_df['x'] = coords_transformed[:, 0]
    result_df['y'] = coords_transformed[:, 1]
    
    return result_df

# Compute coordinates for the first year (reference)
method = 'pca'  # PCA is more stable for small numbers of points
reference_data = get_coordinates(merged_similarity_matrices[years[0]], method=method)

# Compute and align coordinates for all years
aligned_data = {}
for year in years:
    try:
        # Compute initial coordinates
        year_data = get_coordinates(merged_similarity_matrices[year], method=method)
        
        # Align with reference
        if year == years[0]:
            aligned_data[year] = year_data  # Reference year doesn't need alignment
        else:
            aligned_data[year] = align_coordinates(year_data, reference_data)
    except Exception as e:
        print(f"Error processing year {year}: {e}")
        # Skip this year

# Make sure we have at least one valid year after processing
if not aligned_data:
    raise ValueError("No valid data after processing")

# Find the overall range of data across all years to set consistent plot boundaries
all_x = []
all_y = []
for year_data in aligned_data.values():
    all_x.extend(year_data['x'].tolist())  # Convert to list
    all_y.extend(year_data['y'].tolist())  # Convert to list

x_min, x_max = min(all_x), max(all_x)
y_min, y_max = min(all_y), max(all_y)

# Add padding (200% zoom - 50% padding on each side)
padding_x = (x_max - x_min) * 0.5
padding_y = (y_max - y_min) * 0.5
x_range = (float(x_min - padding_x), float(x_max + padding_x))  # Ensure float type
y_range = (float(y_min - padding_y), float(y_max + padding_y))  # Ensure float type

# Create initial plot with first year
current_year = years[0]
init_data = aligned_data[current_year]

# Create ColumnDataSource
source = ColumnDataSource(init_data)

# Create the figure with fixed range
p = figure(width=800, height=600, 
           title=f'EPG Clustering - All Policy Areas Combined ({current_year})',
           tools="pan,wheel_zoom,box_zoom,reset,save,hover",
           x_range=Range1d(x_range[0], x_range[1]),
           y_range=Range1d(y_range[0], y_range[1]),
           tooltips=[
               ("Group", "@full_name (@epg)"),
               ("Ideology", "@ideology")
           ])

# Add scatter points
circles = p.circle('x', 'y', source=source, size=15, color='color', alpha=0.8, 
                 line_color='black', line_width=1)

# Add labels
#labels = LabelSet(x='x', y='y', text='epg', source=source,
#                 text_font_size='10pt', text_color='black',
#                 x_offset=8, y_offset=-8)  # Changed to y_offset=-5 to position labels below points
#p.add_layout(labels)

# Create a legend to explain the colors
legend_items = []
for epg, info in epg_info.items():
    if epg in init_data['epg'].values:
        legend_items.append((f"{epg} - {info['name']}", [p.circle(x=0, y=0, color=info['color'], size=10, visible=False)]))

legend = Legend(items=legend_items, location="top_left")
legend.click_policy = "hide"  # Make the legend interactive
p.add_layout(legend)

# Prepare data for JavaScript
js_data = {}
for year in aligned_data.keys():
    # Convert DataFrame to dictionary safely
    year_dict = {}
    for col in aligned_data[year].columns:
        year_dict[col] = aligned_data[year][col].tolist()  # Convert all values to lists
    js_data[str(year)] = year_dict

# Get sorted years that have data
valid_years = sorted(aligned_data.keys())

# Create a slider for years
year_slider = Slider(title="Year", start=0, end=len(valid_years)-1, value=0, step=1, width=600)

# Create play/pause button
play_button = Button(label="▶️ Play", button_type="success", width=100)

# JavaScript callback for slider
slider_callback = CustomJS(args=dict(source=source, p=p, years=valid_years, data=js_data), code="""
    // Get the selected year index
    const yearIndex = cb_obj.value;
    const year = years[yearIndex];
    
    // Update data from precomputed results
    const year_data = data[year];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - All Policy Areas Combined (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# Connect callback to slider
year_slider.js_on_change('value', slider_callback)

# Animation callback for play button
animation_callback = CustomJS(args=dict(slider=year_slider, button=play_button), code="""
    if (button.label === "▶️ Play") {
        // Start animation
        button.label = "⏸️ Pause";
        
        // Function to increment slider
        function animate_slider() {
            if (button.label === "⏸️ Pause") {
                let current = slider.value;
                let next = current + 1;
                
                // Loop back to beginning if at the end
                if (next > slider.end) {
                    next = slider.start;
                }
                
                // Update slider value (this will trigger the slider callback)
                slider.value = next;
                
                // Schedule next update
                window.setTimeout(animate_slider, 1000);  // 1 second interval
            }
        }
        
        // Start animation
        animate_slider();
    } else {
        // Pause animation
        button.label = "▶️ Play";
    }
""")

# Connect animation callback to play button
play_button.js_on_click(animation_callback)

# Create layout
layout = column(
    row(year_slider, play_button),
    p
)

# Save the visualization to an HTML file
save(layout)

print("Visualization saved to 'epg_clustering_all_policies_merged.html'. Open this file in a web browser to view the animation.")

Visualization saved to 'epg_clustering_all_policies_merged.html'. Open this file in a web browser to view the animation.



'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=

## Interactive verison with selecte clustering and policy area

In [246]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from scipy.spatial import procrustes
from bokeh.plotting import figure, save, output_file
from bokeh.models import (ColumnDataSource, LabelSet, Slider, CustomJS, Button, 
                         Range1d, Select, Legend, LegendItem, HoverTool)
from bokeh.layouts import column, row

# Set output to an HTML file
output_file("plots/epg_clustering_interactive.html")

# Define EPG color scheme
epg_info = {
    'NI': {
        'name': 'Non-Attached Members', 
        'color': '#808080',
        'ideology': 'Mixed/Unaligned', 
        'abbr': 'NI'
    },
    'The Left': {
        'name': 'The Left', 
        'color': '#B71C1C',
        'ideology': 'Left-wing to Far-left', 
        'abbr': 'The Left'
    },
    'S&D': {
        'name': 'Progressive Alliance of Socialists and Democrats', 
        'color': '#D32F2F',
        'ideology': 'Center-left to Left-wing', 
        'abbr': 'S&D'
    },
    'IDG': {
        'name': 'Identity and Democracy Group', 
        'color': '#1A237E',
        'ideology': 'Right-wing to Far-right', 
        'abbr': 'IDG'
    },
    'Greens/EFA': {
        'name': 'Greens/European Free Alliance', 
        'color': '#43A047',
        'ideology': 'Green politics, Regionalist', 
        'abbr': 'Greens/EFA'
    },
    'REG': {
        'name': 'Renew Europe Group', 
        'color': '#039BE5',
        'ideology': 'Centrist, Liberal', 
        'abbr': 'REG'
    },
    'EPP': {
        'name': 'European People\'s Party', 
        'color': '#0D47A1',
        'ideology': 'Center-right to Right-wing', 
        'abbr': 'EPP'
    }
}

# Function to get color for an EPG (using the predefined colors or default to gray)
def get_epg_color(epg):
    if epg in epg_info:
        return epg_info[epg]['color']
    return '#CCCCCC'  # Default gray for unknown EPGs

# Function to get full name for an EPG
def get_epg_full_name(epg):
    if epg in epg_info:
        return epg_info[epg]['name']
    return epg  # Default to the abbreviation if not found

# Function to get ideology for an EPG
def get_epg_ideology(epg):
    if epg in epg_info:
        return epg_info[epg]['ideology']
    return 'Unknown'  # Default to Unknown if not found

# Filter to only include specific policy areas
filtered_policy_areas = [
    'budgetary control', 
    'agriculture', 
    'culture education', 
    'development', 
    'employment social affairs', 
    'environment public health', 
    'fisheries', 
    'gender equality', 
    'international trade', 
    'legal affairs', 
    'regional development'
]

# Get available policy areas that exist in the data
policy_areas = [area for area in filtered_policy_areas if area in similarity_matrices]

# Create a function that will generate aligned coordinates for a given policy area
def generate_coordinates(policy_area):
    # Get years for this policy area
    years = sorted(similarity_matrices[policy_area].keys())
    
    # Get EPGs from the first year
    epgs = list(similarity_matrices[policy_area][years[0]].index)
    
    # Function to perform PCA on a similarity matrix
    def get_coordinates(similarity_matrix):
        # Convert similarity to distance matrix
        distance_matrix = 1 - similarity_matrix
        
        # Convert to numpy array
        X = distance_matrix.values.astype(float)  # Ensure float data type
        
        # Apply PCA
        model = PCA(n_components=2)
        result = model.fit_transform(X)
        
        # Create DataFrame with results - with enhanced metadata
        df_result = pd.DataFrame({
            'x': result[:, 0].astype(float),
            'y': result[:, 1].astype(float),
            'epg': distance_matrix.index.tolist(),
            'color': [get_epg_color(epg) for epg in distance_matrix.index],
            'full_name': [get_epg_full_name(epg) for epg in distance_matrix.index],
            'ideology': [get_epg_ideology(epg) for epg in distance_matrix.index]
        })
        
        return df_result
    
    # Function to align coordinates with reference using Procrustes analysis
    def align_coordinates(target_df, reference_df):
        # Get common EPGs
        common_epgs = set(target_df['epg']).intersection(set(reference_df['epg']))
        
        if len(common_epgs) < 2:
            # Not enough common points to align
            return target_df
        
        # Extract coordinates for common EPGs
        target_coords = np.array([target_df[target_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
        reference_coords = np.array([reference_df[reference_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
        
        # Perform Procrustes analysis to align target to reference
        mtx1, mtx2, disparity = procrustes(reference_coords, target_coords)
        
        # Create transformation matrix (scale, rotation, reflection)
        scale = np.sqrt(np.sum(mtx1[0]**2)) / np.sqrt(np.sum(target_coords[0]**2))
        
        # Apply transformation to all points in target_df
        result_df = target_df.copy()
        coords = result_df[['x', 'y']].values.astype(float)  # Ensure float data type
        
        # Scale and center (simplified Procrustes)
        coords_scaled = coords * scale
        
        # Get centroids
        target_centroid = np.mean(target_coords, axis=0)
        reference_centroid = np.mean(reference_coords, axis=0)
        
        # Translate
        coords_transformed = coords_scaled - target_centroid + reference_centroid
        
        # Update dataframe
        result_df['x'] = coords_transformed[:, 0]
        result_df['y'] = coords_transformed[:, 1]
        
        return result_df
    
    # Compute coordinates for the first year (reference)
    reference_data = get_coordinates(similarity_matrices[policy_area][years[0]])
    
    # Compute and align coordinates for all years
    aligned_data = {}
    for year in years:
        # Compute initial coordinates
        year_data = get_coordinates(similarity_matrices[policy_area][year])
        
        # Align with reference
        if year == years[0]:
            aligned_data[year] = year_data  # Reference year doesn't need alignment
        else:
            aligned_data[year] = align_coordinates(year_data, reference_data)
    
    # Find the overall range of data across all years to set consistent plot boundaries
    all_x = []
    all_y = []
    for year_data in aligned_data.values():
        all_x.extend(year_data['x'].tolist())  # Convert to list
        all_y.extend(year_data['y'].tolist())  # Convert to list
    
    x_min, x_max = min(all_x), max(all_x)
    y_min, y_max = min(all_y), max(all_y)
    
    # Add padding (200% zoom - 50% padding on each side)
    padding_x = (x_max - x_min) * 0.5
    padding_y = (y_max - y_min) * 0.5
    x_range = (float(x_min - padding_x), float(x_max + padding_x))  # Ensure float type
    y_range = (float(y_min - padding_y), float(y_max + padding_y))  # Ensure float type
    
    # Prepare data for JavaScript
    js_data = {}
    for year in years:
        # Convert DataFrame to dictionary safely
        year_dict = {}
        for col in aligned_data[year].columns:
            year_dict[col] = aligned_data[year][col].tolist()  # Convert all values to lists
        js_data[str(year)] = year_dict
    
    return {
        'years': years,
        'data': js_data,
        'init_data': aligned_data[years[0]],
        'x_range': x_range,
        'y_range': y_range
    }

# Generate initial data for the first policy area
initial_policy = policy_areas[0]
initial_data = generate_coordinates(initial_policy)

# Create ColumnDataSource
source = ColumnDataSource(initial_data['init_data'])

# Create the figure with fixed range and hover tool
p = figure(width=800, height=600, 
           title=f'EPG Clustering - {initial_policy} ({initial_data["years"][0]})',
           tools="pan,wheel_zoom,box_zoom,reset,save,hover",
           x_range=Range1d(initial_data['x_range'][0], initial_data['x_range'][1]),
           y_range=Range1d(initial_data['y_range'][0], initial_data['y_range'][1]),
           tooltips=[
               ("Group", "@full_name (@epg)"),
               ("Ideology", "@ideology")
           ])

# Add scatter points
circles = p.circle('x', 'y', source=source, size=15, color='color', alpha=0.8, 
                 line_color='black', line_width=1)

## Add labels
#labels = LabelSet(x='x', y='y', text='epg', source=source,
#                 text_font_size='10pt', text_color='black',
#                 x_offset=10, y_offset=-10)
#p.add_layout(labels)

# Create a legend to explain the colors
legend_items = []
for epg, info in epg_info.items():
    if epg in initial_data['init_data']['epg'].values:
        # Create truly invisible glyphs for the legend by using coordinates outside the plot range
        # This ensures they won't be rendered at all on the actual plot
        invisible_glyph = p.circle(
            x=[float('nan')],  # NaN coordinates won't be plotted
            y=[float('nan')], 
            color=info['color'], 
            size=10
        )
        legend_items.append((f"{epg} - {info['name']}", [invisible_glyph]))

legend = Legend(items=legend_items, location="top_left")
legend.click_policy = "hide"  # Make the legend interactive
p.add_layout(legend)

# Create controls
# Policy area dropdown
policy_select = Select(title="Policy Area:", value=initial_policy, options=policy_areas, width=200)

# Create a slider for years
year_slider = Slider(title="Year", start=0, end=len(initial_data['years'])-1, value=0, step=1, width=400)

# Create play/pause button
play_button = Button(label="▶️ Play", button_type="success", width=100)

# Precompute data for all policy areas
all_data = {}
for policy in policy_areas:
    try:
        print(f"Computing PCA for {policy}...")
        all_data[policy] = generate_coordinates(policy)
    except Exception as e:
        print(f"Error computing PCA for {policy}: {e}")
        # Create empty placeholder
        all_data[policy] = {
            'years': [],
            'data': {},
            'init_data': pd.DataFrame(columns=['x', 'y', 'epg', 'color', 'full_name', 'ideology']),
            'x_range': (-1, 1),
            'y_range': (-1, 1)
        }

# JavaScript callback for policy area selection
policy_callback = CustomJS(args=dict(source=source, p=p, year_slider=year_slider, 
                                  all_data=all_data), code="""
    // Get the selected policy area
    const policy = cb_obj.value;
    
    // Get data for this policy
    const policy_data = all_data[policy];
    const years = policy_data.years;
    
    // Update slider
    year_slider.start = 0;
    year_slider.end = years.length - 1;
    year_slider.value = 0;
    
    // Get first year data
    const year = years[0];
    const year_data = policy_data.data[year];
    
    // Update the plot range
    p.x_range.start = policy_data.x_range[0];
    p.x_range.end = policy_data.x_range[1];
    p.y_range.start = policy_data.y_range[0];
    p.y_range.end = policy_data.y_range[1];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - ${policy} (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# JavaScript callback for slider
slider_callback = CustomJS(args=dict(source=source, p=p, policy_select=policy_select, 
                                 all_data=all_data), code="""
    // Get the selected policy
    const policy = policy_select.value;
    
    // Get data for this policy
    const policy_data = all_data[policy];
    const years = policy_data.years;
    
    // Get the selected year index
    const yearIndex = cb_obj.value;
    const year = years[yearIndex];
    
    // Update data from precomputed results
    const year_data = policy_data.data[year];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - ${policy} (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# Animation callback for play button
animation_callback = CustomJS(args=dict(slider=year_slider, button=play_button), code="""
    if (button.label === "▶️ Play") {
        // Start animation
        button.label = "⏸️ Pause";
        
        // Function to increment slider
        function animate_slider() {
            if (button.label === "⏸️ Pause") {
                let current = slider.value;
                let next = current + 1;
                
                // Loop back to beginning if at the end
                if (next > slider.end) {
                    next = slider.start;
                }
                
                // Update slider value (this will trigger the slider callback)
                slider.value = next;
                
                // Schedule next update
                window.setTimeout(animate_slider, 1000);  // 1 second interval
            }
        }
        
        // Start animation
        animate_slider();
    } else {
        // Pause animation
        button.label = "▶️ Play";
    }
""")

# Connect callbacks
policy_select.js_on_change('value', policy_callback)
year_slider.js_on_change('value', slider_callback)
play_button.js_on_click(animation_callback)

# Create layout
controls = row(
    policy_select,
    year_slider,
    play_button
)

layout = column(
    controls,
    p
)

# Save the visualization to an HTML file
save(layout)

print("Visualization saved to 'epg_clustering_interactive.html'. Open this file in a web browser to interact with the visualization.")


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=

Computing PCA for budgetary control...
Computing PCA for agriculture...
Computing PCA for culture education...
Computing PCA for development...
Computing PCA for employment social affairs...
Computing PCA for environment public health...
Computing PCA for fisheries...
Computing PCA for gender equality...
Computing PCA for international trade...
Computing PCA for legal affairs...
Computing PCA for regional development...
Visualization saved to 'epg_clustering_interactive.html'. Open this file in a web browser to interact with the visualization.


# Prepare more data metrics

In [207]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.decomposition import PCA
from scipy.spatial import procrustes
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist, squareform
import matplotlib.cm as cm
from matplotlib.gridspec import GridSpec
from bokeh.plotting import figure, save, output_file
from bokeh.models import (ColumnDataSource, HoverTool, CustomJS, Slider, Button, 
                         Range1d, LinearColorMapper, ColorBar, BasicTicker, 
                         MultiSelect, CheckboxGroup, Legend, LegendItem)
from bokeh.layouts import column, row, gridplot
from bokeh.palettes import RdBu11, Spectral11, Category20, viridis
from bokeh.transform import linear_cmap
from scipy.stats import linregress

# Create directory for story plots if it doesn't exist
os.makedirs("plots", exist_ok=True)

# Calculate polarization metrics
def calculate_polarization_metrics(similarity_matrices):
    """Calculate cluster distance as a polarization metric for each year and policy area."""
    metrics = {}
    
    for policy_area in similarity_matrices:
        metrics[policy_area] = {}
        
        for year in similarity_matrices[policy_area]:
            sim_matrix = similarity_matrices[policy_area][year]
            
            # Skip if we don't have enough EPGs
            if len(sim_matrix) < 3:
                continue
                
            # Convert similarity to distance
            distance_matrix = 1 - sim_matrix
            
            # Calculate average distance (higher = more polarized)
            avg_distance = np.mean(distance_matrix.values[~np.eye(len(distance_matrix), dtype=bool)])
            
            # Calculate variance of distances (higher = more uneven polarization)
            var_distance = np.var(distance_matrix.values[~np.eye(len(distance_matrix), dtype=bool)])
            
            # Calculate modularity-like measure (higher = more distinct communities)
            # This uses variance/mean as a simple proxy for community structure
            distances_flat = distance_matrix.values[~np.eye(len(distance_matrix), dtype=bool)].flatten()
            if np.mean(distances_flat) > 0:
                modularity = np.var(distances_flat) / np.mean(distances_flat)
            else:
                modularity = 0
            
            # Calculate cohesion within ideological groups
            # Define ideological groups (adapt based on your data)
            ideological_groups = {
                'Left': ['GUE/NGL', 'LEFT', 'THE LEFT', 'GUE-NGL'], 
                'Greens': ['G/EFA', 'Greens/EFA', 'The Greens', 'GREENS'],
                'Social Democrats': ['S&D', 'SOC', 'PES', 'SD', 'PSE'],
                'Liberals': ['ALDE', 'RENEW', 'RE', 'ALDE/ADLE'],
                'Christian Democrats': ['EPP', 'PPE', 'PPE-DE', 'EPP-ED'],
                'Conservatives': ['ECR', 'UEN'],
                'Right-wing': ['ID', 'EFD', 'ENF', 'EFDD', 'IND/DEM', 'ITS'],
                'Regionalists': ['EFA', 'REG', 'EDA'],
                'Non-affiliated': ['NI']
            }
            
            # Calculate in-group vs out-group distances
            in_group_distances = []
            out_group_distances = []
            
            for i, epg_i in enumerate(distance_matrix.index):
                for j, epg_j in enumerate(distance_matrix.index):
                    if i != j:  # Skip self-comparisons
                        distance = distance_matrix.iloc[i, j]
                        
                        # Check if both EPGs are in the same ideological group
                        same_group = False
                        for group, members in ideological_groups.items():
                            if epg_i in members and epg_j in members:
                                same_group = True
                                break
                        
                        if same_group:
                            in_group_distances.append(distance)
                        else:
                            out_group_distances.append(distance)
            
            # Calculate ideological cohesion (lower = more cohesive within ideological groups)
            if in_group_distances and out_group_distances:
                ideological_cohesion = np.mean(in_group_distances) / np.mean(out_group_distances)
            else:
                ideological_cohesion = np.nan
            
            # Store metrics
            metrics[policy_area][year] = {
                'avg_distance': avg_distance,
                'var_distance': var_distance,
                'modularity': modularity,
                'ideological_cohesion': ideological_cohesion,
                'num_epgs': len(sim_matrix)
            }
    
    return metrics


In [208]:
def create_aggregate_matrix(similarity_matrices):
    """Create an aggregate heatmap of EPG similarities across all years and policy areas."""
    
    # Get all unique EPGs
    all_epgs = set()
    for policy_area in similarity_matrices:
        for year in similarity_matrices[policy_area]:
            matrix = similarity_matrices[policy_area][year]
            all_epgs.update(matrix.index)
    
    all_epgs = sorted(list(all_epgs))
    
    # Create an aggregate similarity matrix
    aggregate_matrix = pd.DataFrame(0.0, index=all_epgs, columns=all_epgs)
    count_matrix = pd.DataFrame(0, index=all_epgs, columns=all_epgs)
    
    # Sum all similarity matrices
    for policy_area in similarity_matrices:
        for year in similarity_matrices[policy_area]:
            matrix = similarity_matrices[policy_area][year]
            for i in matrix.index:
                for j in matrix.columns:
                    aggregate_matrix.loc[i, j] += matrix.loc[i, j]
                    count_matrix.loc[i, j] += 1
    
    # Average the similarities
    for i in aggregate_matrix.index:
        for j in aggregate_matrix.columns:
            if count_matrix.loc[i, j] > 0:
                aggregate_matrix.loc[i, j] /= count_matrix.loc[i, j]
            else:
                # If no data, set to NaN (will be shown as white/missing in heatmap)
                aggregate_matrix.loc[i, j] = np.nan
    
    # Set diagonal to 1 (self-similarity)
    for i in aggregate_matrix.index:
        aggregate_matrix.loc[i, i] = 1.0
    
    # Create the heatmap figure
    plt.figure(figsize=(12, 10))
    
    # Calculate the center point dynamically to enhance color contrast
    sim_values = aggregate_matrix.values.flatten()
    sim_values = sim_values[~np.isnan(sim_values)]
    median_sim = np.median(sim_values)
    
    # Use a diverging colormap with enhanced contrast
    cmap = sns.diverging_palette(220, 20, as_cmap=True)  # Blue to red with more contrast
    
    # Create the heatmap with enhanced dynamic range
    sns.heatmap(aggregate_matrix, cmap=cmap, center=median_sim,
                square=True, linewidths=.5, cbar_kws={"shrink": .8},
                vmin=max(0, median_sim - 0.3), vmax=min(1, median_sim + 0.3))  # Adjust range for more color variation
    
    plt.title('Overview: Average EPG Voting Similarity Across All Years and Policy Areas', fontsize=14)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    
    # Save the figure
    plt.tight_layout()
    plt.savefig('plots/1_overview_heatmap.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    return aggregate_matrix

In [209]:
polarization_metrics = calculate_polarization_metrics(similarity_matrices)
aggregate_matrix = create_aggregate_matrix(similarity_matrices)

# Enhanced PCA Visualization with Correct Political Spectrum


In [210]:

# ----------------------
# Enhanced PCA Visualization with Correct Political Spectrum
# ----------------------

def create_pca_visualization(aggregate_matrix):
    """Create a PCA visualization of EPG positions based on aggregate similarity."""
    
    # Clean the matrix (replace NaN with 0)
    clean_matrix = aggregate_matrix.fillna(0)
    
    # Convert similarity to distance
    distance_matrix = 1 - clean_matrix
    
    # Apply PCA
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(distance_matrix)
    
    # Create a DataFrame for the results
    pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'], index=distance_matrix.index)
    
    # Organize EPGs into known political groups (comprehensive mapping)
    political_spectrum = {
        'Left': ['GUE/NGL', 'LEFT', 'THE LEFT', 'GUE-NGL'],
        'Greens': ['G/EFA', 'Greens/EFA', 'The Greens', 'GREENS'],
        'Social Democrats': ['S&D', 'SOC', 'PES', 'SD', 'PSE'],
        'Liberals': ['ALDE', 'RENEW', 'RE', 'ALDE/ADLE'],
        'Christian Democrats': ['EPP', 'PPE', 'PPE-DE', 'EPP-ED'],
        'Conservatives': ['ECR', 'UEN'],
        'Right-wing': ['ID', 'EFD', 'ENF', 'EFDD', 'IND/DEM', 'ITS'],
        'Regionalists': ['EFA', 'REG', 'EDA'],
        'Non-affiliated': ['NI']
    }
    
    # Define colors for the political spectrum (enhanced palette)
    colors = {
        'Left': '#d62728',  # Dark red
        'Greens': '#2ca02c',  # Green
        'Social Democrats': '#ff7f0e',  # Orange
        'Liberals': '#ffff00',  # Yellow
        'Christian Democrats': '#1f77b4',  # Blue
        'Conservatives': '#9467bd',  # Purple
        'Right-wing': '#8c564b',  # Brown
        'Regionalists': '#e377c2',  # Pink
        'Non-affiliated': '#7f7f7f'   # Grey
    }
    
    # Create the PCA visualization
    plt.figure(figsize=(12, 10))
    
    # Create a dictionary to store points for legend (to avoid duplicates)
    legend_handles = {}
    
    # Check each EPG and assign to a political group
    epg_groups = {}
    for epg in pca_df.index:
        assigned = False
        for group, members in political_spectrum.items():
            if any(member.lower() in epg.lower() or epg.lower() in member.lower() for member in members):
                epg_groups[epg] = group
                assigned = True
                break
        if not assigned:
            # Try a more flexible matching for any remaining groups
            for group, members in political_spectrum.items():
                if any(member.split('/')[0] in epg for member in members if '/' in member):
                    epg_groups[epg] = group
                    assigned = True
                    break
        if not assigned:
            # Last resort: check for common abbreviations
            if 'GUE' in epg or 'LEFT' in epg.upper():
                epg_groups[epg] = 'Left'
            elif 'GREEN' in epg.upper() or 'G/E' in epg:
                epg_groups[epg] = 'Greens'
            elif 'S&D' in epg or 'SOC' in epg.upper() or 'SD' in epg.upper():
                epg_groups[epg] = 'Social Democrats'
            elif 'ALDE' in epg or 'LIB' in epg.upper() or 'RENEW' in epg.upper():
                epg_groups[epg] = 'Liberals'
            elif 'EPP' in epg or 'PPE' in epg:
                epg_groups[epg] = 'Christian Democrats'
            elif 'ECR' in epg:
                epg_groups[epg] = 'Conservatives'
            elif 'EFD' in epg or 'ID' in epg or 'ENF' in epg:
                epg_groups[epg] = 'Right-wing'
            elif 'REG' in epg.upper() or 'EFA' in epg:
                epg_groups[epg] = 'Regionalists'
            elif 'NI' in epg:
                epg_groups[epg] = 'Non-affiliated'
            else:
                # If still can't determine, just print a warning and set to 'Other'
                print(f"Warning: Could not assign {epg} to a political group")
                epg_groups[epg] = 'Other'
    
    # Plot each EPG
    for epg in pca_df.index:
        group = epg_groups.get(epg, 'Other')
        color = colors.get(group, 'gray')
        
        # Plot the point
        plt.scatter(pca_df.loc[epg, 'PC1'], pca_df.loc[epg, 'PC2'], 
                   color=color, s=120, alpha=0.8, edgecolors='black')
        
        # Add text label for the EPG
        plt.text(
            pca_df.loc[epg, 'PC1'] + 0.01,    # right (east)
            pca_df.loc[epg, 'PC2'] - 0.01,    # down (south)
            epg,
            fontsize=12,
            weight='bold'
        )

        
        # Add to legend handles
        if group not in legend_handles:
            legend_handles[group] = plt.Line2D([0], [0], marker='o', color='w', 
                                             markerfacecolor=color, markersize=12, 
                                             label=group)
    
    # Add a legend with political groups
    plt.legend(handles=list(legend_handles.values()), 
              title='Political Groups', fontsize=12, 
              title_fontsize=14, loc='best')
    
    # Add labels and title
    explained_var = pca.explained_variance_ratio_
    plt.xlabel(f'Principal Component 1 ({explained_var[0]:.2%} variance)', fontsize=14)
    plt.ylabel(f'Principal Component 2 ({explained_var[1]:.2%} variance)', fontsize=14)
    plt.title('PCA: European Parliament Groups Positioned by Voting Patterns', fontsize=16)
    
    # Add grid lines and enhance visualization
    plt.grid(True, alpha=0.3, linestyle='--')
    
    # Add horizontal and vertical lines at origin for reference
    plt.axhline(y=0, color='k', linestyle='-', alpha=0.3)
    plt.axvline(x=0, color='k', linestyle='-', alpha=0.3)
    
    # Save the figure
    plt.tight_layout()
    plt.savefig('plots/pca_visualization.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    return pca_df


In [211]:
pca_df = create_pca_visualization(aggregate_matrix)

# Interactive Time Series with Multiple Selection

In [212]:

# ----------------------
#  Interactive Time Seres with Multiple Selection
# ----------------------

def create_interactive_polarization_time_series(similarity_matrices, polarization_metrics):
    """Create interactive time series plots of polarization metrics with Bokeh."""
    
    # Create a time series of metrics for each policy area
    time_series_data = {}
    
    for policy_area in polarization_metrics:
        years = sorted(polarization_metrics[policy_area].keys())
        
        avg_distances = [polarization_metrics[policy_area][year]['avg_distance'] 
                        if year in polarization_metrics[policy_area] else None 
                        for year in years]
        
        var_distances = [polarization_metrics[policy_area][year]['var_distance'] 
                        if year in polarization_metrics[policy_area] else None 
                        for year in years]
        
        modularity = [polarization_metrics[policy_area][year]['modularity'] 
                     if year in polarization_metrics[policy_area] else None 
                     for year in years]
        
        if 'ideological_cohesion' in next(iter(polarization_metrics[policy_area].values()), {}):
            cohesion = [polarization_metrics[policy_area][year]['ideological_cohesion'] 
                      if year in polarization_metrics[policy_area] and not np.isnan(polarization_metrics[policy_area][year]['ideological_cohesion']) 
                      else None 
                      for year in years]
        else:
            cohesion = [None] * len(years)
        
        num_epgs = [polarization_metrics[policy_area][year]['num_epgs'] 
                   if year in polarization_metrics[policy_area] else None 
                   for year in years]
        
        time_series_data[policy_area] = {
            'years': years,
            'avg_distance': avg_distances,
            'var_distance': var_distances,
            'modularity': modularity,
            'cohesion': cohesion,
            'num_epgs': num_epgs
        }
    
    # Filter out policy areas with too little data
    filtered_policy_areas = {}
    for policy_area, data in time_series_data.items():
        valid_years = [i for i, d in enumerate(data['avg_distance']) if d is not None]
        if len(valid_years) >= 5:  # At least 5 years of data
            filtered_policy_areas[policy_area] = {
                'years': [data['years'][i] for i in valid_years],
                'avg_distance': [data['avg_distance'][i] for i in valid_years],
                'var_distance': [data['var_distance'][i] for i in valid_years],
                'modularity': [data['modularity'][i] for i in valid_years] if any(data['modularity']) else None,
                'cohesion': [data['cohesion'][i] for i in valid_years] if any(data['cohesion']) else None,
                'num_epgs': [data['num_epgs'][i] for i in valid_years]
            }
    
    # Calculate average polarization across all policy areas by year
    all_years = set()
    for policy_area in polarization_metrics:
        all_years.update(polarization_metrics[policy_area].keys())
    all_years = sorted(all_years)
    
    avg_polarization_by_year = {}
    for year in all_years:
        values = []
        for policy_area in polarization_metrics:
            if year in polarization_metrics[policy_area]:
                values.append(polarization_metrics[policy_area][year]['avg_distance'])
        
        if values:
            avg_polarization_by_year[year] = np.mean(values)
    
    # Create an interactive Bokeh plot
    output_file('plots/interactive_polarization.html')
    
    # Setup initial data source for the average across all policy areas
    avg_source = ColumnDataSource(data=dict(
        years=list(avg_polarization_by_year.keys()),
        values=list(avg_polarization_by_year.values()),
        policy=['Average Across All Areas'] * len(avg_polarization_by_year)
    ))
    
    # Create the figure
    p = figure(width=900, height=600, 
              title='Polarization Trends Over Time by Policy Area',
              x_axis_label='Year', y_axis_label='Polarization (Average Distance)',
              tools="pan,wheel_zoom,box_zoom,reset,save,hover")
    
    # Configure hover tool
    hover = p.select(dict(type=HoverTool))
    hover.tooltips = [
        ("Policy Area", "@policy"),
        ("Year", "@years"),
        ("Polarization", "@values{0.000}")
    ]
    
    # Plot the average line
    avg_line = p.line('years', 'values', source=avg_source, line_width=3, 
                     color='black', alpha=0.8, legend_label="Average Across All Areas")
    avg_circle = p.circle('years', 'values', source=avg_source, size=8, 
                         color='black', alpha=0.8)
    
    # Create multi select widget for policy areas
    sorted_areas = sorted(filtered_policy_areas.keys())
    
    # Select top 5 most varying policy areas as default selected
    policy_variance = []
    for policy in sorted_areas:
        data = filtered_policy_areas[policy]
        if len(data['avg_distance']) > 0:
            policy_variance.append((policy, np.var(data['avg_distance'])))
    
    policy_variance.sort(key=lambda x: x[1], reverse=True)
    default_selected = [p[0] for p in policy_variance[:5]]
    
    # Create a mapping of colors for each policy area
    color_palette = Category20[20]
    policy_colors = {policy: color_palette[i % 20] for i, policy in enumerate(sorted_areas)}
    
    # Create the multi-select widget
    policy_select = MultiSelect(title="Select Policy Areas to Compare:",
                              options=sorted_areas,
                              value=default_selected,
                              height=300,
                              width=300)
    
    # Create data sources for each policy area (initially empty)
    policy_sources = {}
    policy_lines = {}
    policy_circles = {}
    
    for policy in sorted_areas:
        data = filtered_policy_areas[policy]
        
        # Create source
        policy_sources[policy] = ColumnDataSource(data=dict(
            years=data['years'],
            values=data['avg_distance'],
            policy=[policy] * len(data['years']),
            visible=[policy in default_selected] * len(data['years'])
        ))
        
        # Create line and circle, set initial visibility
        visible = policy in default_selected
        
        policy_lines[policy] = p.line('years', 'values', source=policy_sources[policy],
                                    line_width=2, color=policy_colors[policy],
                                    alpha=0.8 if visible else 0,
                                    legend_label=policy if visible else "")
        
        policy_circles[policy] = p.circle('years', 'values', source=policy_sources[policy],
                                        size=6, color=policy_colors[policy],
                                        alpha=0.8 if visible else 0)
    
    # Add legend
    p.legend.location = "top_left"
    p.legend.click_policy = "hide"
    
    # Create a callback for MultiSelect widget
    callback = CustomJS(args=dict(
        policy_sources=policy_sources,
        policy_lines=policy_lines,
        policy_circles=policy_circles,
        policy_colors=policy_colors,
        p=p), code="""
        // Get selected policies
        const selected_policies = cb_obj.value;
        
        // Update each policy line and circle visibility
        for (const policy in policy_sources) {
            const is_selected = selected_policies.includes(policy);
            const alpha = is_selected ? 0.8 : 0;
            
            // Update alpha for lines and circles
            policy_lines[policy].glyph.line_alpha = alpha;
            policy_circles[policy].glyph.fill_alpha = alpha;
            policy_circles[policy].glyph.line_alpha = alpha;
            
            // Update legend (this doesn't work perfectly in Bokeh callbacks)
            if (is_selected) {
                policy_lines[policy].legend_label = policy;
            } else {
                policy_lines[policy].legend_label = "";
            }
        }
        
        // This will trigger a redraw
        for (const policy in policy_sources) {
            policy_sources[policy].change.emit();
        }
    """)
    
    # Attach callback to widget
    policy_select.js_on_change('value', callback)
    
    # Layout
    layout = row(
        policy_select,
        p
    )
    
    # Save to file
    save(layout)
    
    return filtered_policy_areas, avg_polarization_by_year


In [213]:
df[df['year'] == 2022]['month'].value_counts()

month
2    406785
3    370830
5    317955
6    275264
1    161216
4     92355
Name: count, dtype: int64

In [214]:
filtered_policy_areas, avg_polarization_by_year = create_interactive_polarization_time_series(
        similarity_matrices, polarization_metrics
    )


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=


#  Policy Area Ranking (Keep Same)

In [215]:

# ----------------------
# Policy Area Ranking (Keep Same)
# ----------------------

def create_policy_area_ranking(polarization_metrics, avg_polarization_by_year):
    """Create a visualization of policy areas ranked by polarization over time."""
    
    # Get all years and policy areas
    all_years = sorted(set(y for pa in polarization_metrics.values() for y in pa.keys()))
    policy_areas = list(polarization_metrics.keys())
    
    # Create a matrix of polarization values [policy_areas x years]
    polarization_matrix = pd.DataFrame(index=policy_areas, columns=all_years)
    
    for policy_area in policy_areas:
        for year in all_years:
            if year in polarization_metrics[policy_area]:
                polarization_matrix.loc[policy_area, year] = polarization_metrics[policy_area][year]['avg_distance']
    
    # Calculate overall polarization for each policy area (average over years)
    avg_polarization = polarization_matrix.mean(axis=1).dropna()
    
    # Sort policy areas by average polarization
    sorted_areas = avg_polarization.sort_values(ascending=False)
    
    # Take top and bottom 10 policy areas
    top_areas = sorted_areas.head(10)
    bottom_areas = sorted_areas.tail(10)
    
    # Create a figure with two subplots
    fig, axes = plt.subplots(2, 1, figsize=(12, 14), sharex=True)
    
    # Plot top 10 most polarized areas
    top_areas.plot(kind='barh', ax=axes[0], color='tomato')
    axes[0].set_title('Top 10 Most Polarized Policy Areas', fontsize=14)
    axes[0].set_xlabel('Average Polarization', fontsize=12)
    axes[0].grid(True, alpha=0.3)
    
    # Plot bottom 10 least polarized areas
    bottom_areas.plot(kind='barh', ax=axes[1], color='skyblue')
    axes[1].set_title('Top 10 Least Polarized Policy Areas', fontsize=14)
    axes[1].set_xlabel('Average Polarization', fontsize=12)
    axes[1].grid(True, alpha=0.3)
    
    # Adjust layout
    plt.tight_layout()
    plt.savefig('plots/policy_area_ranking.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    return sorted_areas, polarization_matrix


In [216]:
sorted_areas, polarization_matrix = create_policy_area_ranking(
        polarization_metrics, avg_polarization_by_year
    )


# Polarization Trends Visualization (Fixed)

In [217]:

# ----------------------
# Polarization Trends Visualization (Fixed)
# ----------------------

def create_polarization_patterns_visualization(polarization_metrics):
    """Create visualizations of polarization patterns showing clear trends over time."""
    
    # Extract trends in polarization over time for each policy area
    policy_trends = {}
    
    for policy in polarization_metrics:
        years = sorted(polarization_metrics[policy].keys())
        if len(years) >= 5:  # Only consider policies with enough data points
            values = [polarization_metrics[policy][year]['avg_distance'] for year in years]
            
            # Calculate linear trend
            if len(years) > 1:
                try:
                    slope, intercept, r_value, p_value, std_err = linregress(years, values)
                    policy_trends[policy] = {
                        'slope': slope,
                        'r_value': r_value,
                        'p_value': p_value,
                        'years': years,
                        'values': values
                    }
                except Exception as e:
                    print(f"Error calculating trend for {policy}: {e}")
    
    # Sort policies by trend strength (descending by absolute slope)
    sorted_policies = sorted(policy_trends.keys(), 
                            key=lambda p: abs(policy_trends[p]['slope']), 
                            reverse=True)
    
    # Select top 5 increasing and top 5 decreasing trends
    increasing = [p for p in sorted_policies if policy_trends[p]['slope'] > 0][:5]
    decreasing = [p for p in sorted_policies if policy_trends[p]['slope'] < 0][:5]
    
    # Create a figure showing diverging trends
    plt.figure(figsize=(12, 8))
    
    # Plot increasing trends
    plt.subplot(2, 1, 1)
    for i, policy in enumerate(increasing):
        data = policy_trends[policy]
        plt.plot(data['years'], data['values'], 'o-', 
                label=f"{policy} (r={data['r_value']:.2f})")
    
    plt.title('Policy Areas with Increasing Polarization', fontsize=14)
    plt.ylabel('Polarization', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend(fontsize=10)
    
    # Plot decreasing trends
    plt.subplot(2, 1, 2)
    for i, policy in enumerate(decreasing):
        data = policy_trends[policy]
        plt.plot(data['years'], data['values'], 'o-', 
                label=f"{policy} (r={data['r_value']:.2f})")
    
    plt.title('Policy Areas with Decreasing Polarization', fontsize=14)
    plt.xlabel('Year', fontsize=12)
    plt.ylabel('Polarization', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend(fontsize=10)
    
    plt.tight_layout()
    plt.savefig('plots/polarization_trends.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    return policy_trends

In [218]:
policy_trends = create_polarization_patterns_visualization(polarization_metrics)

In [219]:


# Calculate polarization metrics
polarization_metrics = calculate_polarization_metrics(similarity_matrices)

# Part 1: Overview Heatmap with Enhanced Colors

# Part 2: Enhanced PCA Visualization with Correct Political Spectrum
print("Creating enhanced PCA visualization...")
pca_df = create_pca_visualization(aggregate_matrix)

# Part 3: Interactive Time Series with Multiple Selection
print("Creating interactive polarization time series...")
filtered_policy_areas, avg_polarization_by_year = create_interactive_polarization_time_series(
    similarity_matrices, polarization_metrics
)

# Part 4: Policy Area Ranking (Keep Same)
print("Creating policy area ranking...")
sorted_areas, polarization_matrix = create_policy_area_ranking(
    polarization_metrics, avg_polarization_by_year
)

# Part 5: Polarization Trends Visualization (Fixed)
print("Creating polarization trends visualization...")
policy_trends = create_polarization_patterns_visualization(polarization_metrics)


Creating enhanced PCA visualization...
Creating interactive polarization time series...



'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=

Creating policy area ranking...
Creating polarization trends visualization...
