
---

# Motivation

### What is your dataset?

The analysis is done on a comprehensive dataset of European Parliament voting records. This dataset captures how each Member of the European Parliament voted on proposed legislation, along with detailed information about the legislation itself, the MEPs, and their party affiliations. The dataset includes two main file types:

* **RCV (Roll Call Votes):** Contains individual vote records for each MEP, including their country, national party, and European Parliamentary Group (EPG).
* **Voted Docs:** Provides metadata on the legislation being voted on, including the vote date, outcome, and associated policy area.

The data is divided across four parliamentary sessions:

* EP6 (2004–2009)
* EP7 (2009–2014)
* EP8 (2014–2019)
* EP9 (2019–2022)

To enable longitudinal analysis, we merged data from all four sessions. This required extensive preprocessing to reconcile differences in schema and formatting across sessions.

### Why did you choose this particular dataset?

We chose this dataset because it offers a rich foundation for analyzing trends in European politics over time. It enables us to explore whether broader political shifts—such as increasing polarization or the rise of right-leaning ideologies—are observable in parliamentary voting behavior.

### What was your goal for the end user's experience?

Our goal is to provide users with an accessible interface to explore aggregated voting patterns and to present clear, data-driven insights into evolving political dynamics within the European Parliament. We aim to make it easy for users to identify trends related to polarization, such as which parties or policy areas are becoming more divisive, and which remain broadly supported. Ultimately, we want to support informed reflection on how shifts in ideology are shaping legislative decision-making at the EU level.

---

# Basic Stats

### Preprocessing

To analyze voting trends in the European Parliament across four sessions (EP6–EP9), we consolidated roll-call vote data (`RCV`) with metadata on each legislative item (`Voted Docs`). Due to inconsistencies across sessions in schema, date formats, and naming conventions, several preprocessing steps were necessary:

* Merge EP sessions: Merged data from the 4 sessions. Were some inconsistencies in the column names so we standardized them.
* Text Normalization: Policy area fields and party/EPG names were cleaned to remove noise (e.g., punctuation, inconsistent capitalization).
* Date Parsing: Dates appeared in both `dd.mm.yyyy` and `yyyy-mm-dd` formats. A custom parser ensured accurate conversion to a uniform `datetime` format.
* Schema Harmonization: Since column names and vote identifiers varied by session (e.g., `euro_act_id` in EP6 vs `Vote ID` in later sessions), we created session-aware mappings to unify data.
* Party Group Mapping: EP political group (`EPG`) names were mapped to their common abbreviations (e.g., “Group of the European People’s Party...” → `EPP`) to allow for consistent comparison.
* Removed incomplete years: Removed years 2004 and 2022 since they are not finished and might skew yearly results. 
* Name Standardization: MEP names were cleaned to ensure consistency across sessions, accounting for formatting, punctuation, and Unicode differences.


These steps allowed us to generate a unified dataset spanning 2005–2021 with over 27 million individual votes


### Dataset Statistics

📅 **Time Coverage:** The dataset spans years 2005 to 2021

🧑‍🤝‍🧑 **MEPs Involved:** \~2.945 unique MEPs from all 27 EU member states over the 4 EP sessions.

📄 **Votes Recorded:**

* \~27,000 roll-call votes
* \~4.8 million MEP-level voting entries (rows)

🗳️ **Vote Distribution (Sample):**

  | Vote      | Meaning                    | % of Total     |
  | --------- | -------------------------- | -------------- |
  | 1         | For                        | \~42.7%          |
  | 2         | Against                    | \~27.2%          |
  | 3         | Abstain                    | \~3.6%          |
  | 4 / 5 / 6 | Absent / No vote / Excused | \~26.5% combined |


🏛️ **Top Policy Areas:**
A subset of policy areas was selected to capture the most relevant votes:

* Budgetary Control
* Agriculture
* Culture & Education
* Development
* Employment & Social Affairs
* Environment & Public Health
* Fisheries
* Gender Equality
* International Trade
* Legal Affairs
* Regional Development


🧭 **EP Group Participation:**
  The analysis focused on parties that were present across all years. These were also among the largest parties, helping to reduce noise from smaller or less consistent groups:

* NI
* The Left
* S\&D
* IDG
* Greens/EFA
* REG
* EPP


# Data Analysis


Our analysis of European Parliament voting patterns (2005-2022) reveals several key insights:

### Methodology
- **Rice Index** ($RI = \frac{|Y_i - N_i|}{Y_i + N_i + A_i}$) measured internal group cohesion
- **Similarity Index** ($Sim_{1,2} = 1 - \frac{1}{2}(|\%Y_1 - \%Y_2| + |\%N_1 - \%N_2| + |\%A_1 - \%A_2|)$) quantified cross-group alignment
- **PCA** visualized voting relationships and revealed clustering patterns

### Key Findings

1. **Crisis-Driven Unity:** Major crises temporarily increase parliamentary cohesion
   - Financial Crisis (2008): Rice index jumped from 0.5 to 0.63
   - COVID-19 (2020): Cohesion rose to 0.54 during recovery fund votes
   - Post-crisis periods show consistent fragmentation (index fell to 0.47 by 2019)

2. **Group Cohesion Patterns:**
   - Mainstream groups (Greens/EFA, S&D, EPP) maintain high cohesion (Greens/EFA most stable: 0.04)
   - Populist/non-attached groups show persistent fragmentation (Rice values: 0.31-0.58)
   - Each group shows stronger unity on their core ideological issues

3. **Policy-Area Insights:**
   - **Legal Affairs:** Highest volatility across all policy domains
   - **Employment & Environment:** Consistently below-average cohesion, reflecting ideological splits
   - **International Trade:** Growing consensus since 2016 (particularly in 2022)
   - **Budgetary Control:** Sharp realignment during 2012 Eurozone crisis

4. **Cross-Group Relationships:**
   - Stable partnerships: Greens/EFA–The Left and S&D–Renew Europe
   - Persistent opposition: IDG typically distant from other groups
   - COVID-era divergence: Groups spread apart more dramatically after 2020

5. **PCA Visualization:**
   - Two components explain 78.3% of voting variance
   - First component (47.1%) follows traditional left-right spectrum
   - Crisis periods reduce inter-group distances by average 23.7%

Our analysis demonstrates that while European Parliament voting follows clear ideological patterns, external crises create temporary unity followed by increased fragmentation once emergencies subside.

# Genre

Magazine style was the genre used. This works well for our format of storytelling trough a website, and works great when you want to include more text for the contex compared to the other genres.


# Visual Narrative Tools Used

## Visual Structuring
* Consistent Visual Platform: We use same color theme and the same colors to represent the EPGs in all visualizations. Also the names of EPGs and plocy areas remain the same.
* Progress Bar: We have a top navbar that follows the different sections in the website.

## Highlighting
* Close-Ups: In the PCA clustering, we look into what actauly were the mostly related and least related in a table. This really shows the most distinct patterns which is hard to obersve over the moving data.
* Feature Distinction: We use distinct colors for the EPGs, which we keep consistent over all visalizations.
* Motion: We show a animation of the PCA analysis over time.
* Zooming: We have support for zooming in visualizatoin of the PCA for example, but this is not a key feature.

## Transition Guidance
* Familiar Objects: We use this not in the visualizations, but troughout the story by using Emojis.
* Object Continuity: Same colors for EPGs.

# Visualizations

A mix of interactive line charts, bar charts, and animated clustering plots was used, as each visualization plays a distinct role in the narrative:

1. **Time-series line charts** (e.g., Parliament-wide agreement, within-party cohesion) track longitudinal trends in internal unity and division, revealing how crises—such as the Eurozone debt crisis or the COVID-19 pandemic—produce spikes in agreement.

2. **Policy area comparisons** illustrate how polarization varies by topic. Some areas, like legal affairs, fluctuate significantly, while others, such as environment and public health, remain persistently divisive. This adds contextual nuance to the overall polarization trend.

3. **Interactive bar charts** provide drill-down views of individual party behavior across topics, allowing users to explore how ideological agendas shape cohesion on specific issues.

4. **Clustering plots**, based on custom similarity indices, visualize inter-group alignment. These go beyond simple metrics to reveal how close or distant different parties are in their voting behavior. The spatial arrangements offer an intuitive way to communicate abstract political distances.

5. **Tables** are used to complement the clustering plots by providing a clear overview of the most extreme differences or similarities—insights that could otherwise be easy to overlook.


# Discussion. 

Think critically about your creation
What went well?,
What is still missing? What could be improved?, Why?


Overall the project, we were able to find a interesting subject and dataset, and were able to use booth widely used methods like PCA ,but also more complicated statistical methods like PCA to the data. We created a very asethelcially please website, and use internal links to expland on the more 

# Data preprocessing and cleaning

In [193]:
import pandas as pd

def clean_name(first_name, last_name):
    import unicodedata
    
    if not isinstance(first_name, str):
        first_name = str(first_name) if first_name is not None else ""
    if not isinstance(last_name, str):
        last_name = str(last_name) if last_name is not None else ""
    
    first_name = first_name.lower().strip()
    last_name = last_name.lower().strip()
    
    def normalize_chars(text):
        text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII')
        return text
    
    first_name = normalize_chars(first_name)
    last_name = normalize_chars(last_name)
    
    for char in ['-', "'", "`", ".", ",", "&", "'"]: 
        first_name = first_name.replace(char, ' ')
        last_name = last_name.replace(char, ' ')
    
    while '  ' in first_name:
        first_name = first_name.replace('  ', ' ')
    while '  ' in last_name:
        last_name = last_name.replace('  ', ' ')
        
    first_name = ' '.join(word.capitalize() for word in first_name.split())
    last_name = ' '.join(word.capitalize() for word in last_name.split())
    
    full_name = f"{first_name} {last_name}".strip()
    
    return full_name

def clean_text(text):
    if not isinstance(text, str):
        return text
  
    text = text.lower()
    
    for char in ['&', ',', '-']:
        text = text.replace(char, ' ')
    
    text = text.replace(' and ', ' ')
    
    while '  ' in text:
        text = text.replace('  ', ' ')
    
    return text.strip()    

def process_ep_voting_data(rcv_files, voted_docs_files):

    if len(rcv_files) != len(voted_docs_files):
        raise ValueError("The lists of RCV files and Voted docs files must have the same length")
     
    all_data = []
    
    for i, (rcv_file, voted_doc_file) in enumerate(zip(rcv_files, voted_docs_files)):
        print(f"Processing files {i+1}/{len(rcv_files)}: {rcv_file} and {voted_doc_file}")
        
        if "EP6" in rcv_file:
            ep_session = "EP6"
            vote_start_index = 10
            rcv_data = pd.read_excel(rcv_file, header=1)
        elif "EP7" in rcv_file:
            ep_session = "EP7"
            vote_start_index = 9
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        elif "EP8" in rcv_file:
            ep_session = "EP8"
            vote_start_index = 9
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        elif "EP9" in rcv_file:
            ep_session = "EP9"
            vote_start_index = 10
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        else:
            ep_session = "Unknown"
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
            print("UNKNOWN SESSION")

        rcv_data = rcv_data.dropna(how='all')
        
        voted_docs = pd.read_excel(voted_doc_file)


        vote_columns = rcv_data.columns[vote_start_index:].tolist()
       
        votes_df = process_votes_ep(rcv_data, voted_docs, vote_columns, ep_session=ep_session)


        votes_df['ep_session'] = ep_session
        
        all_data.append(votes_df)
    
    combined_df = pd.concat(all_data, ignore_index=True)
    
    return combined_df


def process_votes_ep(rcv_data, voted_docs, vote_columns, ep_session = None):
    """Process voting data for EP7, EP8, EP9 sessions"""

    total_skipped = 0

    if ep_session == 'EP6':
        date = 'date'
        title = 'title'
        policy_area = 'main_policy_name'
        vote_id_key = 'euro_act_id'
        author = 'author_name'

        mep_id_key = 'WebisteEpID'

    else:
        date = 'Date'
        title = 'Title'
        policy_area = 'De'
        vote_id_key = 'Vote ID'
        author = 'Author'

        mep_id_key = 'WebisteEpID'

        if ep_session == 'EP7':
            mep_id_key = 'MEP ID'

        if ep_session == 'EP8':
            policy_area = "De/Policy area"

        elif ep_session == 'EP9':
            policy_area = 'Policy area'
  
    vote_info = {}
    for _, row in voted_docs.iterrows():

        vote_info[str(row[vote_id_key])] = {
            'date': row[date],
            'title': row[title],
            'policy_area': row[policy_area],
            'author': author,
        }
    
    results = []
    
    for _, mep_row in rcv_data.iterrows():
        country = mep_row['Country']
        party = mep_row['Party']
        epg = mep_row['EPG']

        first_name = mep_row['Fname']
        last_name = mep_row['Lname']
        
        mep_id = mep_row[mep_id_key]
    
        # Process each vote for this MEP
        for vote_col in vote_columns:
            
            vote_col = str(vote_col)
            vote_code = f'{ep_session}-{vote_col}' 
            
            if vote_col not in vote_info:
                total_skipped += 1
                continue
            
            try:
                mep_vote = mep_row[str(vote_col)]
            except Exception as e:
                mep_vote = mep_row[int(vote_col)]
            
            if mep_vote == 0:
                continue
                
            info = vote_info[vote_col]
            
            results.append({
                'full name': clean_name(first_name, last_name),
                'country': country,
                'national_party': party,
                'epg': epg,
                'mep_id': mep_id,
                'vote_code': vote_code,
                'vote': mep_vote,
                'date': info['date'],
                'title': info['title'],
                'policy_area': clean_text(info['policy_area']),
            })
    
    print(f"Were not able to match: {total_skipped} votes")
    return pd.DataFrame(results)

In [20]:
import pandas as pd
import numpy as np

def parse_mixed_dates(date_str):
    
    if not isinstance(date_str, str):
        date_str = str(date_str)
    
    date_str = date_str.strip()
    
    try:
        # Check for dates with time component
        if " 00:00:00" in date_str:
            date_str = date_str.replace(" 00:00:00", "")

        if date_str == '18 ian 2007':
            return pd.to_datetime('2007-01-18')
        
        if '.' in date_str:
            # Parse as dd.mm.yyyy
            return pd.to_datetime(date_str, format='%d.%m.%Y')
        elif '-' in date_str:
            # Parse as yyyy-mm-dd
            return pd.to_datetime(date_str, format='%Y-%m-%d')
        elif '/' in date_str:
            try:
                # First try %d/%m/%Y (day/month/4-digit year)
                return pd.to_datetime(date_str, format='%d/%m/%Y')
            except ValueError:
                try:
                    # Then try %d/%m/%y (day/month/2-digit year)
                    return pd.to_datetime(date_str, format='%d/%m/%y')
                except ValueError:
                    # Fall back to pandas default parser with dayfirst=True
                    return pd.to_datetime(date_str, dayfirst=True)
        else:
            # For formats we haven't explicitly handled, use pandas' flexible parser
            return pd.to_datetime(date_str)
            
    except Exception as e:
        print(f"Error parsing date '{date_str}': {e}")
        return pd.NaT  # In case of error
    

def clean_combined_data(df):

    # Standardize date and extract year and month
    df['date'] = df['date'].astype(str).apply(parse_mixed_dates)
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month

    # Map EPG to ture names:
    epg_mapping = {
        "Group of the European People's Party (Christian Democrats)": 'EPP',
        "Group of the European People's Party (Christian Democrats) and European Democrats": 'EPP',
        'Socialist Group in the European Parliament': 'S&D',
        'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament': 'S&D',
        'Confederal Group of the European United Left - Nordic Green Left': 'The Left',
        'Group of the Greens/European Free Alliance': 'Greens/EFA',
        'Independence/Democracy Group': 'IDG',
        'Europe of freedom and democracy Group': 'IDG',
        'Europe of Freedom and Direct Democracy Group': 'IDG',
        'Europe of Nations and Freedom Group': 'ID',
        'European Conservatives and Reformists Group': 'ECR',
        'Non-attached Members': 'NI',
        'Group of the Alliance of Liberals and Democrats for Europe' : 'REG'
    }
    df['epg'] = df['epg'].replace(epg_mapping)


    # Merge PA 
    pa_mapping = {
        'regioanal development': 'regional development',
        'economic monetary affairs': 'economics',
        'juridical affairs': 'legal affairs'
    }
    df['policy_area'] = df['policy_area'].replace(pa_mapping)

    # Data type transformations
    df['epg'] = df['epg'].astype(str)
    df['year'] = df['year'].astype(float).astype(int)   
    df['vote'] = df['vote'].astype(float).astype(int)   

    # Remove incomplete years
    df = df[df['year'] != 2004]
    df = df[df['year'] != 2022]
    
    return df

In [3]:

voted_docs_files = ["VoteWatch-EP-voting-data_2004-2022/EP6_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP7_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP8_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP9_Voted docs.xlsx"]
rcv_files = ["VoteWatch-EP-voting-data_2004-2022/EP6_RCVs_2022_06_13.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP7_RCVs_2014_06_19.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP8_RCVs_2019_06_25.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP9_RCVs_2022_06_22.xlsx"]

df = process_ep_voting_data(rcv_files, voted_docs_files)

print('Saving uncleaned data')

output_file = "ep_voting_data_combined_raw.csv"
df.to_csv(output_file, index=False)

print('Cleaning data')
df = clean_combined_data(df)

print('Saving cleaned data')

output_file = "ep_voting_data_combined_clean.csv"
df.to_csv(output_file, index=False)


Processing files 1/4: VoteWatch-EP-voting-data_2004-2022/EP6_RCVs_2022_06_13.xlsx and VoteWatch-EP-voting-data_2004-2022/EP6_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 5827060
Got: 4759840
Processing files 2/4: VoteWatch-EP-voting-data_2004-2022/EP7_RCVs_2014_06_19.xlsx and VoteWatch-EP-voting-data_2004-2022/EP7_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 5937733
Got: 5233859
Processing files 3/4: VoteWatch-EP-voting-data_2004-2022/EP8_RCVs_2019_06_25.xlsx and VoteWatch-EP-voting-data_2004-2022/EP8_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 8796216
Got: 7696506
Processing files 4/4: VoteWatch-EP-voting-data_2004-2022/EP9_RCVs_2022_06_22.xlsx and VoteWatch-EP-voting-data_2004-2022/EP9_Voted docs.xlsx


  warn(msg)


Were not able to match: 0 votes
Should be around: 10915249
Got: 9520348
Saving uncleaned data
Cleaning data


In [268]:
import pandas as pd

try:
    headers_list = df.columns.tolist()
except NameError:
    df = pd.read_csv('ep_voting_data_combined_clean.csv', dtype=str)
    headers_list = df.columns.tolist()

print("Column headers:")
print(headers_list)


Column headers:
['full name', 'country', 'national_party', 'epg', 'mep_id', 'vote_code', 'vote', 'date', 'title', 'policy_area', 'ep_session', 'year', 'month', 'policy_area_cleaned']


# Background Design 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# Assumes df_votes and policy_year are defined in this environment

# 1) Define focus areas & pick sample
focus_areas = [
    'budgetary control', 'agriculture', 'culture education', 'development',
    'employment social affairs', 'environment public health', 'fisheries',
    'gender equality', 'international trade', 'legal affairs', 'regional development'
]
area_col = 'policy_area_cleaned' if 'policy_area_cleaned' in df_votes.columns else 'policy_area'
areas = [a for a in focus_areas if a in policy_year.columns]

# 2) Sample votes for those areas
sample_n = 300_000
df_sample = (
    df_votes[df_votes[area_col].isin(areas)]
    .sample(min(sample_n, len(df_votes)))
    .rename(columns={area_col: 'policy_area'})[['year','policy_area']]
    .reset_index(drop=True)
)

# 3) Create lookup table for rice values
py = policy_year.reset_index().melt(
    id_vars='year', var_name='policy_area', value_name='rice'
)

# 4) Merge rice into samples
df_merged = df_sample.merge(py, on=['year','policy_area'], how='left').dropna(subset=['rice'])

# 5) Prepare colors
from bokeh.palettes import Category20
palette = Category20[len(areas)]
color_map = {area: palette[i] for i, area in enumerate(areas)}
cols = [color_map[a] for a in df_merged['policy_area']]

# 6) Generate jittered coords with wider spreads
xs = []
ys = []
for year, rice in tqdm(zip(df_merged['year'], df_merged['rice']), total=len(df_merged)):
    xs.append(np.random.uniform(year - 0.5, year + 0.5))
    ys.append(rice + np.random.normal(scale=0.04))

# 7) Plot
plt.figure(figsize=(12, 8))
plt.scatter(xs, ys, s=0.6, c=cols, alpha=0.4, linewidths=0)
plt.axis('off')
plt.tight_layout()
plt.savefig('cover_beeswarm_spread.png', dpi=300, bbox_inches='tight')
plt.close()

print("Saved cover_beeswarm_spread.png")

# European Parliament Groups (EPGs)

In [231]:
from bokeh.models import ColumnDataSource, HTMLTemplateFormatter, TableColumn
from bokeh.models.widgets import DataTable
from bokeh.io import output_file, save
import pandas as pd

# --- EPG Info ---
epg_info = {
    'NI': {
        'name': 'Non-Attached Members', 
        'color': '#808080',
        'ideology': 'Mixed/Unaligned', 
        'abbr': 'NI'
    },
    'The Left': {
        'name': 'The Left', 
        'color': '#B71C1C',
        'ideology': 'Left-wing to Far-left', 
        'abbr': 'The Left'
    },
    'S&D': {
        'name': 'Progressive Alliance of Socialists and Democrats', 
        'color': '#D32F2F',
        'ideology': 'Center-left to Left-wing', 
        'abbr': 'S&D'
    },
    'IDG': {
        'name': 'Identity and Democracy Group', 
        'color': '#1A237E',
        'ideology': 'Right-wing to Far-right', 
        'abbr': 'IDG'
    },
    'Greens/EFA': {
        'name': 'Greens/European Free Alliance', 
        'color': '#43A047',
        'ideology': 'Green politics, Regionalist', 
        'abbr': 'Greens/EFA'
    },
    'REG': {
        'name': 'Renew Europe Group', 
        'color': '#039BE5',
        'ideology': 'Centrist, Liberal', 
        'abbr': 'REG'
    },
    'EPP': {
        'name': 'European People\'s Party', 
        'color': '#0D47A1',
        'ideology': 'Center-right to Right-wing', 
        'abbr': 'EPP'
    }
}

# --- Filter and count MEPs ---
df_2015 = df[(df['year'] == 2015) & (df['epg'].isin(epg_info.keys()))]
unique_meps_2015 = df_2015[['mep_id', 'epg']].drop_duplicates()
total_meps = unique_meps_2015['mep_id'].nunique()
mep_counts = unique_meps_2015.groupby('epg')['mep_id'].nunique().to_dict()

# --- Build table data ---
rows = []
for epg, info in epg_info.items():
    count = mep_counts.get(epg, 0)
    percent = (count / total_meps * 100) if total_meps > 0 else 0
    rows.append({
        'Color': info['color'],
        'ColorBox': f"<div style='width:20px; height:20px; background-color:{info['color']}; border:1px solid #000;'></div>",
        'Abbreviation': info['abbr'],
        'Full Name': f"<span style='color:{info['color']}; font-weight:bold;'>{info['name']}</span>",
        'Ideology': info['ideology'],
        'MEPs (2015)': f"{percent:.2f}%"
    })

df_table = pd.DataFrame(rows).sort_values('Color')
source = ColumnDataSource(df_table)

# --- Define Columns with HTML formatting ---
columns = [
    TableColumn(field='ColorBox', title='', formatter=HTMLTemplateFormatter(template='<%= value %>')),
    TableColumn(field='Abbreviation', title='Abbr.'),
    TableColumn(field='Full Name', title='Full Name', formatter=HTMLTemplateFormatter(template='<%= value %>')),
    TableColumn(field='Ideology', title='Ideology'),
    TableColumn(field='MEPs (2015)', title='MEPs (2015)')
]

# --- Create Bokeh DataTable ---
data_table = DataTable(source=source, columns=columns, width=1000, height=280, index_position=None)

# --- Output HTML ---
output_file("epg_table_bokeh.html")
save(data_table)

print("✅ Saved: plots/epg_table_bokeh.html")


✅ Saved: epg_table_bokeh.html


# Political polarization within EPGs

In [None]:
def rice_index(yes, no, abstain):
    return abs(yes - no) / (yes + no + abstain)

In [None]:
# keep only real votes
df_votes = df[df['vote_code'].notna() & df['vote'].isin([1,2,3])].copy()

# count yes/no/abstain per roll-call (vote_code) & year
counts = (
    df_votes
    .groupby(['vote_code','year'])['vote']
    .value_counts()
    .unstack(fill_value=0)
    .rename(columns={1:'yes', 2:'no', 3:'abstain'})
    .reset_index()
)

# compute Rice for each vote_code
counts['rice'] = counts.apply(lambda r: rice_index(r.yes, r.no, r.abstain), axis=1)

# average across all vote_codes in each year
yearly = (
    counts
    .groupby('year')['rice']
    .mean()
    .reset_index()
    .sort_values('year')
)

#    Count yes/no/abstain per (party, vote_code, year)
pv = (
    df_votes
    .groupby(['epg', 'vote_code', 'year'])['vote']
    .value_counts()
    .unstack(fill_value=0)
    .rename(columns={1: 'yes', 2: 'no', 3: 'abstain'})
    .reset_index()
)
pv['rice'] = pv.apply(lambda r: rice_index(r.yes, r.no, r.abstain), axis=1)

# Average Rice per party & year, pivot so each party is a column
party_year = (
    pv
    .groupby(['epg', 'year'])['rice']
    .mean()
    .reset_index()
    .pivot(index='year', columns='epg', values='rice')
    .sort_index()
)

# Agreement by policy area by year
pp = (
    df_votes
    .groupby(['policy_area', 'vote_code', 'year'])['vote']
    .value_counts()
    .unstack(fill_value=0)
    .rename(columns={1: 'yes', 2: 'no', 3: 'abstain'})
    .reset_index()
)
pp['rice'] = pp.apply(lambda r: rice_index(r.yes, r.no, r.abstain), axis=1)

policy_year = (
    pp
    .groupby(['policy_area', 'year'])['rice']
    .mean()
    .reset_index()
    .pivot(index='year', columns='policy_area', values='rice')
    .sort_index()
)

#  Count yes/no/abstain per (party, policy_area)
pp = (
    df_votes
    .groupby(['epg','policy_area'])['vote']
    .value_counts()
    .unstack(fill_value=0)
    .rename(columns={1:'yes',2:'no',3:'abstain'})
    .reset_index()
)


#  Pivot so each party is a column, rows are policy_area
party_policy = (
    pp
    .pivot(index='policy_area', columns='epg', values='rice')
    .fillna(0)               # in case some party never voted in an area
    .sort_index()
)



# Parliament-wide Agreement


In [None]:
import pandas as pd
import numpy as np
from bokeh.io import output_file, save
from bokeh.plotting import figure
from bokeh.models import (
    ColumnDataSource, HoverTool,
    CheckboxGroup, Select, CustomJS
)
from bokeh.layouts import column, row
from bokeh.palettes import Category20

# ──────────────────────────────────────────────────────────────────────────────
# COMMON FIGURE KWARGS
# ──────────────────────────────────────────────────────────────────────────────
FIG_KWARGS = dict(
    width=700, height=500,
    background_fill_color="white",
    border_fill_color="white",
    y_range=(0.4,0.7),
    tools="pan,wheel_zoom,box_zoom,reset,save,hover",
)

# ──────────────────────────────────────────────────────────────────────────────
# 1️⃣ Parliament‐wide static line
# ──────────────────────────────────────────────────────────────────────────────
p1 = figure(
    x_axis_label="Year", y_axis_label="Rice index",
    **FIG_KWARGS
)
# line
p1.line(yearly['year'], yearly['rice'], line_width=2)
# style text black
p1.title.text_color               = 'black'
p1.xaxis.axis_label_text_color    = 'black'
p1.yaxis.axis_label_text_color    = 'black'
p1.xaxis.major_label_text_color   = 'black'
p1.yaxis.major_label_text_color   = 'black'
# hover (shows year and rice)
hover1 = p1.select_one(HoverTool)
hover1.tooltips = [("Year","@x"),("Rice","@y{0.000}")]
# save
output_file("01_parliament_agreement.html", title="Parliament-wide Agreement")
save(p1)

# Polarization by Policy Area

In [None]:
import numpy as np
from bokeh.io import output_file, save
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.layouts import row
from bokeh.palettes import Category20

# ─── 0) Focus areas and data prep ──────────────────────────────────────────
focus_areas = [
    'budgetary control', 'agriculture', 'culture education', 'development',
    'employment social affairs', 'environment public health', 'fisheries',
    'gender equality', 'international trade', 'legal affairs', 'regional development'
]
areas = [a for a in focus_areas if a in policy_year.columns]

# build the avg source
avg_src = ColumnDataSource(dict(
    year   = yearly['year'].tolist(),
    rice   = yearly['rice'].tolist(),
    policy = ['Average'] * len(yearly),
))

# build one CDS per policy area
sources = {}
for a in areas:
    sources[a] = ColumnDataSource(dict(
        year   = policy_year.index.tolist(),
        rice   = policy_year[a].tolist(),
        policy = [a]*len(policy_year)
    ))

# color map
palette = Category20[len(areas)]
colors = {a: palette[i] for i,a in enumerate(areas)}

# ─── 1) Figure setup ───────────────────────────────────────────────────────
p = figure(
    width=750, height=600,
    x_axis_label="Year", y_axis_label="Rice index",
    tools="pan,wheel_zoom,box_zoom,reset,save,hover",
    background_fill_color="white", border_fill_color="white",
    y_range=(0,1),
)
# force black text everywhere
p.title.text_color                = 'black'
p.xaxis.axis_label_text_color     = 'black'
p.yaxis.axis_label_text_color     = 'black'
p.xaxis.major_label_text_color    = 'black'
p.yaxis.major_label_text_color    = 'black'

# hover
hover = p.select_one(HoverTool)
hover.tooltips = [
    ("Policy", "@policy"),
    ("Year",   "@year"),
    ("Rice",   "@rice{0.000}")
]

# ─── 2) Plot the average line ───────────────────────────────────────────────
p.line('year','rice', source=avg_src,
       line_width=3, color='black', alpha=0.8, legend_label="Average")

# ─── 3) Plot each focus‐area line ───────────────────────────────────────────
for area in areas:
    p.line('year','rice', source=sources[area],
           line_width=2, color=colors[area],
           alpha=0.8, legend_label=area)

# Move legend to bottom center, make it horizontal, use 4 columns, click to hide
p.legend.location         = "bottom_center"
p.legend.orientation      = "horizontal"
p.legend.click_policy     = "hide"
p.legend.label_text_color = "black"
p.legend.title_text_color = "black"
p.legend.ncols          = 4
# light grey background + subtle border
p.legend.background_fill_color = "#f0f0f0"   # pale grey
p.legend.background_fill_alpha = 1.0

# Then layout & save as before
output_file("agreement_by_selected_policy_areas.html",
            title="Agreement by Selected Policy Areas")
save(p)

# Within-Party Cohesion

In [None]:
import numpy as np
from bokeh.io import output_file, save
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool, Legend, LegendItem
from bokeh.layouts import column

# 0) Your EPG info mapping (7 parties)
epg_info = {
    'NI':        {'color':'#808080','abbr':'NI'},
    'The Left':  {'color':'#B71C1C','abbr':'The Left'},
    'S&D':       {'color':'#D32F2F','abbr':'S&D'},
    'IDG':       {'color':'#1A237E','abbr':'IDG'},
    'Greens/EFA':{'color':'#43A047','abbr':'Greens/EFA'},
    'REG':       {'color':'#039BE5','abbr':'REG'},
    'EPP':       {'color':'#0D47A1','abbr':'EPP'},
}

# 1) Restrict to complete parties 2005–2021
years = list(range(2005,2022))
valid_parties = [
    p for p in party_year.columns
    if p in epg_info and party_year.loc[years, p].notna().all()
]
# (should be all 7 now)
party_year = party_year[valid_parties]

# 2) Build data source & figure
src = ColumnDataSource({
    'year': party_year.index.tolist(),
    **{p: party_year[p].tolist() for p in valid_parties}
})

p = figure(
    width=700, height=500,
    x_axis_label="Year", y_axis_label="Rice index",
    tools="pan,wheel_zoom,box_zoom,reset,save,hover",
    background_fill_color="white", border_fill_color="white",
    y_range=(0,1)
)

# Force all text to black
for axis in (p.xaxis, p.yaxis):
    axis.axis_label_text_color = 'black'
    axis.major_label_text_color = 'black'
p.title.text_color = 'black'

# Hover on lines
hover = p.select_one(HoverTool)
hover.tooltips = [
    ("Party", "$name"),
    ("Year",  "@year"),
    ("Rice",  "@$name{0.000}")
]

# 3) Draw **all** lines (alpha=0.8 so all visible)
lines = {}
for party in valid_parties:
    lines[party] = p.line(
        'year', party, source=src,
        line_width=2, color=epg_info[party]['color'],
        line_alpha=0.8, name=party
    )

# 4) Build **all** LegendItems
legend_items = [
    LegendItem(label=epg_info[p]['abbr'], renderers=[lines[p]])
    for p in valid_parties
]
legend = Legend(
    items=legend_items,
    location="bottom_left",
    click_policy="hide",
    label_text_color="black",
    background_fill_color="#f0f0f0",
    border_line_color="#d3d3d3",
    border_line_width=1
)
p.add_layout(legend)

# 5) Save
output_file("02_within_party_agreement.html", title="Within-Party Agreement")
save(p)


# Party Disagreement by Policy Area


In [None]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, Select, CustomJS
from bokeh.layouts import column

output_notebook()



allowed_parties = list(epg_info.keys())

# ─── 1) Prepare the CDS with filtered policies & parties ────────────────────
# Filter policy areas
policies = [area for area in party_policy.index if area in focus_areas]

# Filter parties by epg_info
parties = [p for p in party_policy.columns if p in allowed_parties]

# Build data dict
data = {'policy_area': policies}
for p in parties:
    # take rice scores only for our focus areas
    data[p] = [party_policy.loc[area, p] for area in policies]

# initialize “y” to the first allowed party
initial = parties[0]
data['y'] = data[initial]

source = ColumnDataSource(data)

# ─── 2) Make the figure (categorical x-axis) ───────────────────────────────
p6 = figure(x_range=policies,
           x_axis_label='Policy Area',
           y_axis_label='Rice index',
           width=700, height=700,
           background_fill_color="white",
           border_fill_color="white",
           y_range=(0,1),
           tools="pan,wheel_zoom,box_zoom,reset,save,hover",
          )
p6.vbar(x='policy_area', top='y', source=source, width=0.8)
p6.xaxis.major_label_orientation = 1.2

# ─── 3) Force all text to black ─────────────────────────────────────────────
p6.title.text_color                   = 'black'
p6.xaxis.axis_label_text_color        = 'black'
p6.yaxis.axis_label_text_color        = 'black'
p6.xaxis.major_label_text_color       = 'black'
p6.yaxis.major_label_text_color       = 'black'


# ─── 3) Dropdown widget ────────────────────────────────────────────────────
dropdown = Select(title="Party:", value=initial, options=parties, width=200)

# ─── 4) JS callback to swap in the chosen party’s column ────────────────────
callback = CustomJS(args=dict(src=source), code="""
    const d = src.data;
    const choice = cb_obj.value;
    d['y'] = d[choice];
    src.change.emit();
""")
dropdown.js_on_change('value', callback)

# ─── 5) Layout & show ──────────────────────────────────────────────────────
show(column(dropdown, p6))

# Then layout & save as before
output_file("04_bar_rice_by_policy_and_party.html",
            title="Agreement by Selected Policy Areas")
save((dropdown, p6))


## Create simmilarity matrix between EPGs for each year and policy area

In [199]:
from itertools import combinations

def agreement_index(votes1, votes2):

    total_votes_1 = sum(votes1)
    total_votes_2 = sum(votes2)
    
    total_difference = 0
    
    for i in range(3): #Consider 3 vote types, 1 (yes), 2 (no), 3 (abstain)

        pct1 = votes1[i] / total_votes_1
        pct2 = votes2[i] / total_votes_2
        
        # Add minimum percentage as the agreement for this vote type
        total_difference += abs(pct1 - pct2)
    
    return 1 - (total_difference / 2)

def agreement_index(votes1, votes2):
    """
    Rice index
    """

    total_votes = sum(votes1) + sum(votes2)
    
    total_yes = votes1[0] + votes2[0]
    total_no = votes1[1] + votes2[1]
    
    
    return abs(total_yes - total_no) / total_votes



In [200]:

# Filter out EPG that appear in all years
years = df['year'].unique()
epgs_by_year = [set(df[df['year'] == year]['epg'].dropna().unique()) for year in years]
common_epgs = set.intersection(*epgs_by_year)


# Modified code to use the new similarity function
df_year = df.groupby(['year', 'policy_area'])
epgs = sorted(list(common_epgs))
similarity_matrices = {}


# Track missing data
missing_data_count = 0
total_combinations = 0

for name, group in df_year:
    year, policy_area = name 
    year = int(year)
    sim_matrix = pd.DataFrame(index=epgs, columns=epgs)
    
    # Track missing data for this year/policy_area
    missing_in_current = 0
    total_in_current = 0
    
    # Calculate similarities between all EPG pairs
    for epg1, epg2 in combinations(epgs, 2):
        total_combinations += 1
        total_in_current += 1
        
        ep1_votes_series = group[group['epg'] == epg1]['vote'].value_counts()
        ep2_votes_series = group[group['epg'] == epg2]['vote'].value_counts()

        # Initialize arrays with zeros for vote types 1, 2, and 3
        ep1_votes = [0, 0, 0]  # Index 0 for vote value 1, index 1 for vote value 2, etc.
        ep2_votes = [0, 0, 0]

        # Fill in the counts from the Series, handling both int and float vote types
        for vote_type, count in ep1_votes_series.items():
            vote_index = int(float(vote_type)) - 1  # Convert vote type (1,2,3) to index (0,1,2)
            if 0 <= vote_index <= 2:  # Only include vote types 1, 2, and 3
                ep1_votes[vote_index] = count

        for vote_type, count in ep2_votes_series.items():
            vote_index = int(float(vote_type)) - 1
            if 0 <= vote_index <= 2:
                ep2_votes[vote_index] = count

        
        # Check if any relevant votes exist for both EPGs
        has_votes_1 = sum(ep1_votes) != 0
        has_votes_2 = sum(ep2_votes) != 0
        
        if not has_votes_1 or not has_votes_2:
            missing_data_count += 1
            missing_in_current += 1
            similarity = 0  # No relevant votes, similarity is 0
        else:
            similarity = agreement_index(ep1_votes, ep2_votes)
            
        sim_matrix.loc[epg1, epg2] = similarity
        sim_matrix.loc[epg2, epg1] = similarity  # Matrix is symmetric
    
    # Set diagonal to 1 (self-similarity)
    for epg in epgs:
        sim_matrix.loc[epg, epg] = 1.0
            
    # Store the matrix
    if policy_area not in similarity_matrices:
        similarity_matrices[policy_area] = {}
    similarity_matrices[policy_area][year] = sim_matrix
    
    # Print summary for this year/policy_area
    if missing_in_current > 0:
        print(f"Year {year}, Policy Area '{policy_area}': {missing_in_current}/{total_in_current} EPG pairs have missing vote data ({missing_in_current/total_in_current:.1%})")

# Print overall summary
print(f"\nOverall: {missing_data_count}/{total_combinations} EPG pairs have missing vote data ({missing_data_count/total_combinations:.1%})")


Overall: 0/7791 EPG pairs have missing vote data (0.0%)


# Visualizing Voting Alignment between EPGs

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from scipy.spatial import procrustes
from bokeh.plotting import figure, save, output_file
from bokeh.models import ColumnDataSource, LabelSet, Slider, CustomJS, Button, Range1d, Legend, LegendItem
from bokeh.layouts import column, row

# Set output to an HTML file
output_file("plots/epg_clustering_all_policies_merged.html")
 

# Merge data from all policy areas for each year
merged_similarity_matrices = {}

# Get all unique years across all policy areas
all_years = set()
for policy_area in similarity_matrices:
    all_years.update(similarity_matrices[policy_area].keys())
all_years = sorted(all_years)

# Get all unique EPGs
all_epgs = set()
for policy_area in similarity_matrices:
    for year in similarity_matrices[policy_area]:
        all_epgs.update(similarity_matrices[policy_area][year].index)
all_epgs = sorted(list(all_epgs))

# Merge data for each year by averaging similarity scores across policy areas
for year in all_years:
    # Create a DataFrame with zeros for all EPG pairs - explicitly use float dtype
    merged_matrix = pd.DataFrame(0.0, index=all_epgs, columns=all_epgs, dtype=float)
    count_matrix = pd.DataFrame(0, index=all_epgs, columns=all_epgs, dtype=int)
    
    # Add similarity scores from each policy area
    for policy_area in similarity_matrices:
        if year in similarity_matrices[policy_area]:
            policy_matrix = similarity_matrices[policy_area][year]
            for epg1 in policy_matrix.index:
                for epg2 in policy_matrix.columns:
                    if epg1 in merged_matrix.index and epg2 in merged_matrix.columns:
                        # Explicitly convert to float to avoid dtype issues
                        value = float(policy_matrix.loc[epg1, epg2])
                        merged_matrix.loc[epg1, epg2] += value
                        count_matrix.loc[epg1, epg2] += 1
    
    # Average the similarity scores
    for epg1 in merged_matrix.index:
        for epg2 in merged_matrix.columns:
            if count_matrix.loc[epg1, epg2] > 0:
                merged_matrix.loc[epg1, epg2] /= float(count_matrix.loc[epg1, epg2])
            else:
                # No data for this pair, set to 0 if different EPGs, 1 if same EPG
                merged_matrix.loc[epg1, epg2] = 1.0 if epg1 == epg2 else 0.0
    
    # Store the merged matrix
    merged_similarity_matrices[year] = merged_matrix

# Get years with enough data
years = [year for year in all_years if len(merged_similarity_matrices[year]) >= 3]

# Make sure we have at least one valid year
if not years:
    raise ValueError("No years have enough EPGs for visualization")

# Get EPGs from the first year
epgs = list(merged_similarity_matrices[years[0]].index)

# Function to get color for an EPG (using the predefined colors or default to gray)
def get_epg_color(epg):
    if epg in epg_info:
        return epg_info[epg]['color']
    return '#CCCCCC'  # Default gray for unknown EPGs

# Function to get full name for an EPG
def get_epg_full_name(epg):
    if epg in epg_info:
        return epg_info[epg]['name']
    return epg  # Default to the abbreviation if not found

# Function to get ideology for an EPG
def get_epg_ideology(epg):
    if epg in epg_info:
        return epg_info[epg]['ideology']
    return 'Unknown'  # Default to Unknown if not found

# Function to perform dimensionality reduction on a similarity matrix
def get_coordinates(similarity_matrix, method='pca'):
    # Convert similarity to distance matrix
    distance_matrix = 1 - similarity_matrix
    
    # Convert to numpy array
    X = distance_matrix.values.astype(float)  # Ensure float data type
    
    # Apply dimensionality reduction
    if method == 'pca':
        model = PCA(n_components=2)
        result = model.fit_transform(X)
    else:
        raise ValueError(f"Unknown method: {method}")
    
    # Create DataFrame with results - use explicit dtypes
    df_result = pd.DataFrame({
        'x': result[:, 0].astype(float),
        'y': result[:, 1].astype(float),
        'epg': distance_matrix.index.tolist(),
        'color': [get_epg_color(epg) for epg in distance_matrix.index],
        'full_name': [get_epg_full_name(epg) for epg in distance_matrix.index],
        'ideology': [get_epg_ideology(epg) for epg in distance_matrix.index]
    })
    
    return df_result

# Function to align coordinates with reference using Procrustes analysis
def align_coordinates(target_df, reference_df):
    # Get common EPGs
    common_epgs = set(target_df['epg']).intersection(set(reference_df['epg']))
    
    if len(common_epgs) < 2:
        # Not enough common points to align
        return target_df
    
    # Extract coordinates for common EPGs
    target_coords = np.array([target_df[target_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
    reference_coords = np.array([reference_df[reference_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
    
    # Perform Procrustes analysis to align target to reference
    mtx1, mtx2, disparity = procrustes(reference_coords, target_coords)
    
    # Create transformation matrix (scale, rotation, reflection)
    scale = np.sqrt(np.sum(mtx1[0]**2)) / np.sqrt(np.sum(target_coords[0]**2))
    
    # Apply transformation to all points in target_df
    result_df = target_df.copy()
    coords = result_df[['x', 'y']].values.astype(float)  # Ensure float data type
    
    # Scale and center (simplified Procrustes)
    coords_scaled = coords * scale
    
    # Get centroids
    target_centroid = np.mean(target_coords, axis=0)
    reference_centroid = np.mean(reference_coords, axis=0)
    
    # Translate
    coords_transformed = coords_scaled - target_centroid + reference_centroid
    
    # Update dataframe
    result_df['x'] = coords_transformed[:, 0]
    result_df['y'] = coords_transformed[:, 1]
    
    return result_df

# Compute coordinates for the first year (reference)
method = 'pca'  # PCA is more stable for small numbers of points
reference_data = get_coordinates(merged_similarity_matrices[years[0]], method=method)

# Compute and align coordinates for all years
aligned_data = {}
for year in years:
    try:
        # Compute initial coordinates
        year_data = get_coordinates(merged_similarity_matrices[year], method=method)
        
        # Align with reference
        if year == years[0]:
            aligned_data[year] = year_data  # Reference year doesn't need alignment
        else:
            aligned_data[year] = align_coordinates(year_data, reference_data)
    except Exception as e:
        print(f"Error processing year {year}: {e}")
        # Skip this year

# Make sure we have at least one valid year after processing
if not aligned_data:
    raise ValueError("No valid data after processing")

# Find the overall range of data across all years to set consistent plot boundaries
all_x = []
all_y = []
for year_data in aligned_data.values():
    all_x.extend(year_data['x'].tolist())  # Convert to list
    all_y.extend(year_data['y'].tolist())  # Convert to list

x_min, x_max = min(all_x), max(all_x)
y_min, y_max = min(all_y), max(all_y)

# Add padding (200% zoom - 50% padding on each side)
padding_x = (x_max - x_min) * 0.5
padding_y = (y_max - y_min) * 0.5
x_range = (float(x_min - padding_x), float(x_max + padding_x))  # Ensure float type
y_range = (float(y_min - padding_y), float(y_max + padding_y))  # Ensure float type

# Create initial plot with first year
current_year = years[0]
init_data = aligned_data[current_year]

# Create ColumnDataSource
source = ColumnDataSource(init_data)

# Create the figure with fixed range
p = figure(width=800, height=600, 
           title=f'EPG Clustering - All Policy Areas Combined ({current_year})',
           tools="pan,wheel_zoom,box_zoom,reset,save,hover",
           x_range=Range1d(x_range[0], x_range[1]),
           y_range=Range1d(y_range[0], y_range[1]),
           tooltips=[
               ("Group", "@full_name (@epg)"),
               ("Ideology", "@ideology")
           ])

# Add scatter points
circles = p.circle('x', 'y', source=source, size=15, color='color', alpha=0.8, 
                 line_color='black', line_width=1)

# Add labels
#labels = LabelSet(x='x', y='y', text='epg', source=source,
#                 text_font_size='10pt', text_color='black',
#                 x_offset=8, y_offset=-8)  # Changed to y_offset=-5 to position labels below points
#p.add_layout(labels)

# Create a legend to explain the colors
legend_items = []
for epg, info in epg_info.items():
    if epg in init_data['epg'].values:
        legend_items.append((f"{epg} - {info['name']}", [p.circle(x=0, y=0, color=info['color'], size=10, visible=False)]))

legend = Legend(items=legend_items, location="top_left")
legend.click_policy = "hide"  # Make the legend interactive
p.add_layout(legend)

# Prepare data for JavaScript
js_data = {}
for year in aligned_data.keys():
    # Convert DataFrame to dictionary safely
    year_dict = {}
    for col in aligned_data[year].columns:
        year_dict[col] = aligned_data[year][col].tolist()  # Convert all values to lists
    js_data[str(year)] = year_dict

# Get sorted years that have data
valid_years = sorted(aligned_data.keys())

# Create a slider for years
year_slider = Slider(title="Year", start=0, end=len(valid_years)-1, value=0, step=1, width=600)

# Create play/pause button
play_button = Button(label="▶️ Play", button_type="success", width=100)

# JavaScript callback for slider
slider_callback = CustomJS(args=dict(source=source, p=p, years=valid_years, data=js_data), code="""
    // Get the selected year index
    const yearIndex = cb_obj.value;
    const year = years[yearIndex];
    
    // Update data from precomputed results
    const year_data = data[year];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - All Policy Areas Combined (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# Connect callback to slider
year_slider.js_on_change('value', slider_callback)

# Animation callback for play button
animation_callback = CustomJS(args=dict(slider=year_slider, button=play_button), code="""
    if (button.label === "▶️ Play") {
        // Start animation
        button.label = "⏸️ Pause";
        
        // Function to increment slider
        function animate_slider() {
            if (button.label === "⏸️ Pause") {
                let current = slider.value;
                let next = current + 1;
                
                // Loop back to beginning if at the end
                if (next > slider.end) {
                    next = slider.start;
                }
                
                // Update slider value (this will trigger the slider callback)
                slider.value = next;
                
                // Schedule next update
                window.setTimeout(animate_slider, 1000);  // 1 second interval
            }
        }
        
        // Start animation
        animate_slider();
    } else {
        // Pause animation
        button.label = "▶️ Play";
    }
""")

# Connect animation callback to play button
play_button.js_on_click(animation_callback)

# Create layout
layout = column(
    row(year_slider, play_button),
    p
)

# Save the visualization to an HTML file
save(layout)

print("Visualization saved to 'epg_clustering_all_policies_merged.html'. Open this file in a web browser to view the animation.")

# Analyse sim matrix

In [201]:
import pandas as pd
import numpy as np
from itertools import combinations

# Function to find most and least related EPG pairs from a similarity matrix
def get_most_least_related(similarity_matrix):
    # Create a copy to avoid modifying the original
    sim_mat = similarity_matrix.copy()
    
    # Exclude NI group if it exists in the matrix
    if 'NI' in sim_mat.index:
        sim_mat = sim_mat.drop('NI', axis=0)
    if 'NI' in sim_mat.columns:
        sim_mat = sim_mat.drop('NI', axis=1)
    
    # If matrix is empty after dropping NI, return empty results
    if sim_mat.empty:
        return {
            'most_related': {'pairs': [], 'similarity': np.nan},
            'least_related': {'pairs': [], 'similarity': np.nan}
        }
    
    # Set diagonal to NaN to exclude self-relationships
    np.fill_diagonal(sim_mat.values, np.nan)
    
    # Find max similarity (most related)
    max_val = sim_mat.max().max()
    max_idx = np.where(sim_mat == max_val)
    max_pairs = list(zip(sim_mat.index[max_idx[0]], sim_mat.columns[max_idx[1]]))
    
    # Find min similarity (least related)
    min_val = sim_mat.min().min()
    min_idx = np.where(sim_mat == min_val)
    min_pairs = list(zip(sim_mat.index[min_idx[0]], sim_mat.columns[min_idx[1]]))
    
    # Remove duplicates due to symmetry
    max_pairs = [tuple(sorted(pair)) for pair in max_pairs]
    min_pairs = [tuple(sorted(pair)) for pair in min_pairs]
    max_pairs = list(dict.fromkeys(max_pairs))
    min_pairs = list(dict.fromkeys(min_pairs))
    
    return {
        'most_related': {'pairs': max_pairs, 'similarity': max_val},
        'least_related': {'pairs': min_pairs, 'similarity': min_val}
    }

# Create a dataframe to store results
policy_results = []

# Dictionaries to store yearly aggregated matrices across policy areas
yearly_matrices = {}
all_years = set()

# Analyze each policy area
for policy_area, years_data in similarity_matrices.items():
    # First analyze each year individually
    for year, sim_matrix in sorted(years_data.items()):
        # Convert year to int if it's not already (to ensure consistent handling)
        year_val = int(year) if not isinstance(year, int) else year
        
        # Track all years for later use
        all_years.add(year_val)
        
        # Add to yearly aggregated matrices
        if year_val not in yearly_matrices:
            yearly_matrices[year_val] = sim_matrix.copy()
        else:
            yearly_matrices[year_val] += sim_matrix
        
        relations = get_most_least_related(sim_matrix)
        
        # Skip if no valid relations (e.g., only NI group was present)
        if np.isnan(relations['most_related']['similarity']):
            continue
        
        # Format the results
        most_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']])
        least_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']])
        
        policy_results.append({
            'Policy Area': policy_area,
            'Year': str(year_val),  # Convert to string for consistent handling
            'Most Related EPGs': most_related_pairs,
            'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
            'Least Related EPGs': least_related_pairs,
            'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
        })
    
    # Calculate the average similarity matrix across all years for this policy area
    avg_matrix = None
    years_count = 0
    
    for year, matrix in years_data.items():
        if avg_matrix is None:
            avg_matrix = matrix.copy()
        else:
            avg_matrix += matrix
        years_count += 1
    
    if avg_matrix is not None and years_count > 0:
        avg_matrix = avg_matrix / years_count
        
        # Get the most and least related from the average matrix
        relations = get_most_least_related(avg_matrix)
        
        # Skip if no valid relations
        if np.isnan(relations['most_related']['similarity']):
            continue
        
        # Format the results
        most_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']])
        least_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']])
        
        policy_results.append({
            'Policy Area': policy_area,
            'Year': 'TOTAL',
            'Most Related EPGs': most_related_pairs,
            'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
            'Least Related EPGs': least_related_pairs,
            'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
        })

# Calculate the average for each year across all policy areas
yearly_results = []
policy_areas_count = len(similarity_matrices)

for year in sorted(yearly_matrices.keys()):
    yearly_avg_matrix = yearly_matrices[year] / policy_areas_count
    relations = get_most_least_related(yearly_avg_matrix)
    
    # Skip if no valid relations
    if np.isnan(relations['most_related']['similarity']):
        continue
    
    # Format the results
    most_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']])
    least_related_pairs = ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']])
    
    yearly_results.append({
        'Year': str(year),  # Convert to string
        'Most Related EPGs': most_related_pairs,
        'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
        'Least Related EPGs': least_related_pairs,
        'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
    })

# Calculate overall total across all years and policy areas
overall_matrix = None
matrix_count = 0

for policy_area, years_data in similarity_matrices.items():
    for year, matrix in years_data.items():
        if overall_matrix is None:
            overall_matrix = matrix.copy()
        else:
            overall_matrix += matrix
        matrix_count += 1

if overall_matrix is not None and matrix_count > 0:
    overall_matrix = overall_matrix / matrix_count
    
    # Get the most and least related from the overall matrix
    relations = get_most_least_related(overall_matrix)
    
    # Add to yearly results as a total row
    if not np.isnan(relations['most_related']['similarity']):
        yearly_results.append({
            'Year': 'TOTAL',
            'Most Related EPGs': ', '.join([f"{p[0]}-{p[1]}" for p in relations['most_related']['pairs']]),
            'Most Related Similarity': f"{relations['most_related']['similarity']:.3f}",
            'Least Related EPGs': ', '.join([f"{p[0]}-{p[1]}" for p in relations['least_related']['pairs']]),
            'Least Related Similarity': f"{relations['least_related']['similarity']:.3f}"
        })

# Convert to dataframes
policy_df = pd.DataFrame(policy_results)
yearly_df = pd.DataFrame(yearly_results)

# Create a custom sorting order to ensure TOTAL is at the end
def custom_sort(df, col):
    # Extract the non-TOTAL values and sort them numerically
    non_total = [x for x in df[col].unique() if x != 'TOTAL']
    # Convert to ints for proper numerical sorting, then back to strings
    sorted_non_total = [str(x) for x in sorted([int(x) for x in non_total])]
    # Create the full sort order with TOTAL at the end
    sort_order = {val: i for i, val in enumerate(sorted_non_total + ['TOTAL'])}
    # Apply the sort
    return df.sort_values(by=col, key=lambda x: x.map(sort_order))

# Apply custom sorting
if not policy_df.empty:
    policy_df = policy_df.sort_values('Policy Area')
    policy_df = custom_sort(policy_df, 'Year')

if not yearly_df.empty:
    yearly_df = custom_sort(yearly_df, 'Year')

# Display the yearly aggregated results
print("Yearly EPG Relationship Analysis (All Policy Areas Combined)")
print("=" * 100)
print("Note: NI (Non-Attached Members) excluded from analysis")
print("-" * 100)
if not yearly_df.empty:
    print(yearly_df.to_string(index=False))
else:
    print("No data available after excluding NI group")
print("\n\n")

# Display the policy-specific results
print("EPG Relationship Analysis by Policy Area and Year")
print("=" * 100)
print("Note: NI (Non-Attached Members) excluded from analysis")
print("-" * 100)

if not policy_df.empty:
    for policy_area, group in policy_df.groupby('Policy Area'):
        print(f"\nPolicy Area: {policy_area}")
        print("-" * 100)
        
        # Apply custom sorting within each group to ensure TOTAL is at the end
        group = custom_sort(group, 'Year')
        
        # Format for display
        display_df = group[['Year', 'Most Related EPGs', 'Most Related Similarity', 
                            'Least Related EPGs', 'Least Related Similarity']]
        
        # Convert to string representation with aligned columns
        print(display_df.to_string(index=False))
else:
    print("No data available after excluding NI group")

Yearly EPG Relationship Analysis (All Policy Areas Combined)
Note: NI (Non-Attached Members) excluded from analysis
----------------------------------------------------------------------------------------------------
 Year   Most Related EPGs Most Related Similarity Least Related EPGs Least Related Similarity
 2005 Greens/EFA-The Left                   0.341       IDG-The Left                    0.196
 2006 Greens/EFA-The Left                   0.407       IDG-The Left                    0.167
 2007 Greens/EFA-The Left                   0.288       IDG-The Left                    0.161
 2008      Greens/EFA-REG                   0.448       IDG-The Left                    0.273
 2009      Greens/EFA-S&D                   0.554       IDG-The Left                    0.283
 2010      Greens/EFA-S&D                   0.463       IDG-The Left                    0.263
 2011             REG-S&D                   0.494       IDG-The Left                    0.177
 2012             REG-S&D      

# How Similar European Parliament Groups Voted Over Time Across Policy Areas


In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from scipy.spatial import procrustes
from bokeh.plotting import figure, save, output_file
from bokeh.models import (ColumnDataSource, LabelSet, Slider, CustomJS, Button, 
                         Range1d, Select, Legend, LegendItem, HoverTool)
from bokeh.layouts import column, row

# Set output to an HTML file
output_file("plots/epg_clustering_interactive.html")

# Function to get color for an EPG (using the predefined colors or default to gray)
def get_epg_color(epg):
    if epg in epg_info:
        return epg_info[epg]['color']
    return '#CCCCCC'  # Default gray for unknown EPGs

# Function to get full name for an EPG
def get_epg_full_name(epg):
    if epg in epg_info:
        return epg_info[epg]['name']
    return epg  # Default to the abbreviation if not found

# Function to get ideology for an EPG
def get_epg_ideology(epg):
    if epg in epg_info:
        return epg_info[epg]['ideology']
    return 'Unknown'  # Default to Unknown if not found

# Filter to only include specific policy areas
filtered_policy_areas = [
    'budgetary control', 
    'agriculture', 
    'culture education', 
    'development', 
    'employment social affairs', 
    'environment public health', 
    'fisheries', 
    'gender equality', 
    'international trade', 
    'legal affairs', 
    'regional development'
]

# Get available policy areas that exist in the data
policy_areas = [area for area in filtered_policy_areas if area in similarity_matrices]

# Create a function that will generate aligned coordinates for a given policy area
def generate_coordinates(policy_area):
    # Get years for this policy area
    years = sorted(similarity_matrices[policy_area].keys())
    
    # Get EPGs from the first year
    epgs = list(similarity_matrices[policy_area][years[0]].index)
    
    # Function to perform PCA on a similarity matrix
    def get_coordinates(similarity_matrix):
        # Convert similarity to distance matrix
        distance_matrix = 1 - similarity_matrix
        
        # Convert to numpy array
        X = distance_matrix.values.astype(float)  # Ensure float data type
        
        # Apply PCA
        model = PCA(n_components=2)
        result = model.fit_transform(X)
        
        # Create DataFrame with results - with enhanced metadata
        df_result = pd.DataFrame({
            'x': result[:, 0].astype(float),
            'y': result[:, 1].astype(float),
            'epg': distance_matrix.index.tolist(),
            'color': [get_epg_color(epg) for epg in distance_matrix.index],
            'full_name': [get_epg_full_name(epg) for epg in distance_matrix.index],
            'ideology': [get_epg_ideology(epg) for epg in distance_matrix.index]
        })
        
        return df_result
    
    # Function to align coordinates with reference using Procrustes analysis
    def align_coordinates(target_df, reference_df):
        # Get common EPGs
        common_epgs = set(target_df['epg']).intersection(set(reference_df['epg']))
        
        if len(common_epgs) < 2:
            # Not enough common points to align
            return target_df
        
        # Extract coordinates for common EPGs
        target_coords = np.array([target_df[target_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
        reference_coords = np.array([reference_df[reference_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
        
        # Perform Procrustes analysis to align target to reference
        mtx1, mtx2, disparity = procrustes(reference_coords, target_coords)
        
        # Create transformation matrix (scale, rotation, reflection)
        scale = np.sqrt(np.sum(mtx1[0]**2)) / np.sqrt(np.sum(target_coords[0]**2))
        
        # Apply transformation to all points in target_df
        result_df = target_df.copy()
        coords = result_df[['x', 'y']].values.astype(float)  # Ensure float data type
        
        # Scale and center (simplified Procrustes)
        coords_scaled = coords * scale
        
        # Get centroids
        target_centroid = np.mean(target_coords, axis=0)
        reference_centroid = np.mean(reference_coords, axis=0)
        
        # Translate
        coords_transformed = coords_scaled - target_centroid + reference_centroid
        
        # Update dataframe
        result_df['x'] = coords_transformed[:, 0]
        result_df['y'] = coords_transformed[:, 1]
        
        return result_df
    
    # Compute coordinates for the first year (reference)
    reference_data = get_coordinates(similarity_matrices[policy_area][years[0]])
    
    # Compute and align coordinates for all years
    aligned_data = {}
    for year in years:
        # Compute initial coordinates
        year_data = get_coordinates(similarity_matrices[policy_area][year])
        
        # Align with reference
        if year == years[0]:
            aligned_data[year] = year_data  # Reference year doesn't need alignment
        else:
            aligned_data[year] = align_coordinates(year_data, reference_data)
    
    # Find the overall range of data across all years to set consistent plot boundaries
    all_x = []
    all_y = []
    for year_data in aligned_data.values():
        all_x.extend(year_data['x'].tolist())  # Convert to list
        all_y.extend(year_data['y'].tolist())  # Convert to list
    
    x_min, x_max = min(all_x), max(all_x)
    y_min, y_max = min(all_y), max(all_y)
    
    # Add padding (200% zoom - 50% padding on each side)
    padding_x = (x_max - x_min) * 0.5
    padding_y = (y_max - y_min) * 0.5
    x_range = (float(x_min - padding_x), float(x_max + padding_x))  # Ensure float type
    y_range = (float(y_min - padding_y), float(y_max + padding_y))  # Ensure float type
    
    # Prepare data for JavaScript
    js_data = {}
    for year in years:
        # Convert DataFrame to dictionary safely
        year_dict = {}
        for col in aligned_data[year].columns:
            year_dict[col] = aligned_data[year][col].tolist()  # Convert all values to lists
        js_data[str(year)] = year_dict
    
    return {
        'years': years,
        'data': js_data,
        'init_data': aligned_data[years[0]],
        'x_range': x_range,
        'y_range': y_range
    }

# Generate initial data for the first policy area
initial_policy = policy_areas[0]
initial_data = generate_coordinates(initial_policy)

# Create ColumnDataSource
source = ColumnDataSource(initial_data['init_data'])

# Create the figure with fixed range and hover tool
p = figure(width=800, height=600, 
           title=f'EPG Clustering - {initial_policy} ({initial_data["years"][0]})',
           tools="pan,wheel_zoom,box_zoom,reset,save,hover",
           x_range=Range1d(initial_data['x_range'][0], initial_data['x_range'][1]),
           y_range=Range1d(initial_data['y_range'][0], initial_data['y_range'][1]),
           tooltips=[
               ("Group", "@full_name (@epg)"),
               ("Ideology", "@ideology")
           ])

# Add scatter points
circles = p.circle('x', 'y', source=source, size=15, color='color', alpha=0.8, 
                 line_color='black', line_width=1)

## Add labels
#labels = LabelSet(x='x', y='y', text='epg', source=source,
#                 text_font_size='10pt', text_color='black',
#                 x_offset=10, y_offset=-10)
#p.add_layout(labels)

# Create a legend to explain the colors
legend_items = []
for epg, info in epg_info.items():
    if epg in initial_data['init_data']['epg'].values:
        # Create truly invisible glyphs for the legend by using coordinates outside the plot range
        # This ensures they won't be rendered at all on the actual plot
        invisible_glyph = p.circle(
            x=[float('nan')],  # NaN coordinates won't be plotted
            y=[float('nan')], 
            color=info['color'], 
            size=10
        )
        legend_items.append((f"{epg} - {info['name']}", [invisible_glyph]))

legend = Legend(items=legend_items, location="top_left")
legend.click_policy = "hide"  # Make the legend interactive
p.add_layout(legend)

# Create controls
# Policy area dropdown
policy_select = Select(title="Policy Area:", value=initial_policy, options=policy_areas, width=200)

# Create a slider for years
year_slider = Slider(title="Year", start=0, end=len(initial_data['years'])-1, value=0, step=1, width=400)

# Create play/pause button
play_button = Button(label="▶️ Play", button_type="success", width=100)

# Precompute data for all policy areas
all_data = {}
for policy in policy_areas:
    try:
        print(f"Computing PCA for {policy}...")
        all_data[policy] = generate_coordinates(policy)
    except Exception as e:
        print(f"Error computing PCA for {policy}: {e}")
        # Create empty placeholder
        all_data[policy] = {
            'years': [],
            'data': {},
            'init_data': pd.DataFrame(columns=['x', 'y', 'epg', 'color', 'full_name', 'ideology']),
            'x_range': (-1, 1),
            'y_range': (-1, 1)
        }

# JavaScript callback for policy area selection
policy_callback = CustomJS(args=dict(source=source, p=p, year_slider=year_slider, 
                                  all_data=all_data), code="""
    // Get the selected policy area
    const policy = cb_obj.value;
    
    // Get data for this policy
    const policy_data = all_data[policy];
    const years = policy_data.years;
    
    // Update slider
    year_slider.start = 0;
    year_slider.end = years.length - 1;
    year_slider.value = 0;
    
    // Get first year data
    const year = years[0];
    const year_data = policy_data.data[year];
    
    // Update the plot range
    p.x_range.start = policy_data.x_range[0];
    p.x_range.end = policy_data.x_range[1];
    p.y_range.start = policy_data.y_range[0];
    p.y_range.end = policy_data.y_range[1];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - ${policy} (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# JavaScript callback for slider
slider_callback = CustomJS(args=dict(source=source, p=p, policy_select=policy_select, 
                                 all_data=all_data), code="""
    // Get the selected policy
    const policy = policy_select.value;
    
    // Get data for this policy
    const policy_data = all_data[policy];
    const years = policy_data.years;
    
    // Get the selected year index
    const yearIndex = cb_obj.value;
    const year = years[yearIndex];
    
    // Update data from precomputed results
    const year_data = policy_data.data[year];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - ${policy} (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# Animation callback for play button
animation_callback = CustomJS(args=dict(slider=year_slider, button=play_button), code="""
    if (button.label === "▶️ Play") {
        // Start animation
        button.label = "⏸️ Pause";
        
        // Function to increment slider
        function animate_slider() {
            if (button.label === "⏸️ Pause") {
                let current = slider.value;
                let next = current + 1;
                
                // Loop back to beginning if at the end
                if (next > slider.end) {
                    next = slider.start;
                }
                
                // Update slider value (this will trigger the slider callback)
                slider.value = next;
                
                // Schedule next update
                window.setTimeout(animate_slider, 1000);  // 1 second interval
            }
        }
        
        // Start animation
        animate_slider();
    } else {
        // Pause animation
        button.label = "▶️ Play";
    }
""")

# Connect callbacks
policy_select.js_on_change('value', policy_callback)
year_slider.js_on_change('value', slider_callback)
play_button.js_on_click(animation_callback)

# Create layout
controls = row(
    policy_select,
    year_slider,
    play_button
)

layout = column(
    controls,
    p
)

# Save the visualization to an HTML file
save(layout)

print("Visualization saved to 'epg_clustering_interactive.html'. Open this file in a web browser to interact with the visualization.")

# Find out more about the most distinct simmilarities between EPGs over years and policy areas

In [255]:
import pandas as pd
import numpy as np
from itertools import combinations
from bokeh.models import ColumnDataSource, HTMLTemplateFormatter, TableColumn
from bokeh.models.widgets import DataTable
from bokeh.io import output_file, save
from bokeh.layouts import column

def extract_extreme_similarities(similarity_matrices, top_n=10):
    """
    Extract the top N most and least related EPG pairs across all policy areas and years.
    
    Parameters:
    -----------
    similarity_matrices : dict
        Dictionary with policy areas as keys and a nested dictionary of years -> similarity matrices
    top_n : int
        Number of top/bottom pairs to extract
        
    Returns:
    --------
    dict: Contains 'most_related' and 'least_related' dataframes with the results
    """
    # Lists to store all pair similarities
    all_similarities = []
    
    # Analyze each policy area
    for policy_area, years_data in similarity_matrices.items():
        # Process each year
        for year, sim_matrix in years_data.items():
            # Convert year to string for consistent handling
            year_str = str(year)
            
            # Create a copy and exclude NI group if it exists
            sim_mat = sim_matrix.copy()
            if 'NI' in sim_mat.index:
                sim_mat = sim_mat.drop('NI', axis=0)
            if 'NI' in sim_mat.columns:
                sim_mat = sim_mat.drop('NI', axis=1)
            
            # Skip if matrix is empty after dropping NI
            if sim_mat.empty:
                continue
            
            # Process each EPG pair
            for epg1 in sim_mat.index:
                for epg2 in sim_mat.columns:
                    # Skip diagonal (self-comparisons)
                    if epg1 == epg2:
                        continue
                    
                    # Skip duplicates due to symmetry (only process each pair once)
                    if epg1 > epg2:
                        continue
                    
                    similarity = sim_mat.loc[epg1, epg2]
                    
                    # Skip NaN values
                    if np.isnan(similarity):
                        continue
                    
                    all_similarities.append({
                        'EPG Pair': f"{epg1}-{epg2}",
                        'EPG1': epg1,
                        'EPG2': epg2,
                        'Similarity': similarity,
                        'Year': year_str,
                        'Policy Area': policy_area
                    })
    
    # Convert to DataFrame
    df = pd.DataFrame(all_similarities)
    
    # If no data, return empty results
    if df.empty:
        return {
            'most_related': pd.DataFrame(),
            'least_related': pd.DataFrame()
        }
    
    # Sort by similarity (descending for most related, ascending for least related)
    most_related = df.sort_values('Similarity', ascending=False).head(top_n).copy()
    least_related = df.sort_values('Similarity', ascending=True).head(top_n).copy()
    
    # Format the similarity value
    most_related['Similarity'] = most_related['Similarity'].apply(lambda x: f"{x:.3f}")
    least_related['Similarity'] = least_related['Similarity'].apply(lambda x: f"{x:.3f}")
    
    # Rename columns for clarity
    most_related.rename(columns={'Similarity': 'Most Related'}, inplace=True)
    least_related.rename(columns={'Similarity': 'Least Related'}, inplace=True)
    
    return {
        'most_related': most_related,
        'least_related': least_related
    }

def create_similarity_tables(similarity_matrices, top_n=10):
    """
    Create Bokeh tables displaying the top most and least related EPG pairs.
    
    Parameters:
    -----------
    similarity_matrices : dict
        Dictionary with policy areas as keys and a nested dictionary of years -> similarity matrices
    top_n : int
        Number of top/bottom pairs to display
        
    Returns:
    --------
    tuple: (most_related_table, least_related_table) - Bokeh DataTable objects
    """
    # Extract data
    results = extract_extreme_similarities(similarity_matrices, top_n)
    
    # Format data for display
    for key in ['most_related', 'least_related']:
        # Skip if empty
        if results[key].empty:
            continue
        
        # Add EPG colors
        results[key]['Color1'] = results[key]['EPG1'].apply(
            lambda x: epg_info.get(x, {}).get('color', '#CCCCCC')
        )
        results[key]['Color2'] = results[key]['EPG2'].apply(
            lambda x: epg_info.get(x, {}).get('color', '#CCCCCC')
        )
        
        # Format EPG pair with colors
        results[key]['Formatted EPGs'] = results[key].apply(
            lambda row: f"""
                <div style="display:flex; align-items:center;">
                    <div style="width:15px; height:15px; background-color:{row['Color1']}; 
                         border:1px solid #000; margin-right:5px;"></div>
                    <span style="color:{row['Color1']}; font-weight:bold;">{row['EPG1']}</span>
                    <span style="margin:0 5px;">-</span>
                    <div style="width:15px; height:15px; background-color:{row['Color2']}; 
                         border:1px solid #000; margin-right:5px;"></div>
                    <span style="color:{row['Color2']}; font-weight:bold;">{row['EPG2']}</span>
                </div>
            """, 
            axis=1
        )
    
    # Create Bokeh tables
    tables = {}
    
    for key, title in [
        ('most_related', 'Most Related EPG Pairs'), 
        ('least_related', 'Least Related EPG Pairs')
    ]:
        # Skip if empty
        if results[key].empty:
            tables[key] = None
            continue
        
        # Convert to ColumnDataSource
        source = ColumnDataSource(results[key])
        
        # Create columns with fixed widths
        columns = [
            TableColumn(
                field='Formatted EPGs', 
                title='EPG Pair', 
                formatter=HTMLTemplateFormatter(template='<%= value %>'),
                width=200
            ),
            TableColumn(
                field=key.replace('_', ' ').title(), 
                title='Similarity',
                width=100
            ),
            TableColumn(field='Year', title='Year', width=80),
            TableColumn(field='Policy Area', title='Policy Area', width=200)
        ]
        
        # Calculate exact height needed
        num_rows = len(results[key])
        row_height = 35
        header_height = 30
        exact_height = (num_rows * row_height) + header_height
        
        # Create DataTable
        data_table = DataTable(
            source=source,
            columns=columns,
            width=600,
            height=exact_height,
            index_position=None,
            sizing_mode="fixed",
            autosize_mode="none",
            fit_columns=False,
            min_height=exact_height,
            max_height=exact_height
        )
        
        tables[key] = data_table
    
    return tables

def save_similarity_tables(similarity_matrices, output_prefix="epg_similarity", top_n=10):
    """
    Extract and save tables of most and least related EPG pairs.
    
    Parameters:
    -----------
    similarity_matrices : dict
        Dictionary with policy areas as keys and a nested dictionary of years -> similarity matrices
    output_prefix : str
        Prefix for output filenames
    top_n : int
        Number of top/bottom pairs to display
    """
    # Create tables
    tables = create_similarity_tables(similarity_matrices, top_n)
    
    # Define custom HTML template for proper sizing
    html_template = """
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="utf-8">
        <title>{{ title }}</title>
        <style>
            html, body {
                margin: 0;
                padding: 0;
                height: {{ height }}px;
                overflow: hidden;
            }
            .bk-root {
                height: {{ height }}px !important;
                overflow: hidden !important;
            }
        </style>
        {{ resources }}
        {{ script }}
        {{ div }}
    </head>
    <body>
    </body>
    </html>
    """
    
    # Save each table
    for key, table in tables.items():
        if table is None:
            print(f"No data available for {key} table")
            continue
        
        # Get height for template
        num_rows = len(table.source.data['Year'])
        exact_height = (num_rows * 35) + 30
        
        # Prepare template with correct height
        current_template = html_template.replace("{{ height }}", str(exact_height))
        
        # Save to HTML file
        output_file = f"{output_prefix}_{key}.html"
        output_file(output_file, title=key.replace('_', ' ').title())
        
        # Save with custom template
        from bokeh.resources import CDN
        from bokeh.embed import file_html
        
        html = file_html(
            table, 
            CDN, 
            key.replace('_', ' ').title(),
            template=current_template
        )
        
        with open(output_file, "w") as f:
            f.write(html)
        
        print(f"✅ Saved: {output_file}")
        
        # Generate iframe code
        iframe_html = f"""
        <div style="display: flex; justify-content: center; margin-bottom: 20px;">
          <iframe 
            src="/images/{output_file}"
            style="width: 90vw; max-width: 600px; height: {exact_height}px; border: none; overflow: hidden;"
            loading="lazy" 
            scrolling="no">
          </iframe>
        </div>
        """
        
        print(f"\nUse this iframe code for {key}:")
        print(iframe_html)

# Example usage:
# save_similarity_tables(similarity_matrices, "epg_similarity", 10)

# If you just want the processed data without creating tables:
def get_extreme_similarities_data(similarity_matrices, top_n=10):
    """
    Extract and return the top N most and least related EPG pairs as DataFrames.
    """
    results = extract_extreme_similarities(similarity_matrices, top_n)
    
    # Select and reorder columns for display
    columns = ['EPG Pair', 'Most Related' if 'Most Related' in results['most_related'].columns else 'Similarity', 
               'Year', 'Policy Area']
    
    most_related = results['most_related'][columns].copy() if not results['most_related'].empty else pd.DataFrame()
    
    columns = ['EPG Pair', 'Least Related' if 'Least Related' in results['least_related'].columns else 'Similarity', 
               'Year', 'Policy Area']
    least_related = results['least_related'][columns].copy() if not results['least_related'].empty else pd.DataFrame()
    
    return {
        'most_related': most_related,
        'least_related': least_related
    }

#import pandas as pd
import numpy as np
from itertools import combinations
from bokeh.models import ColumnDataSource, HTMLTemplateFormatter, TableColumn
from bokeh.models.widgets import DataTable
from bokeh.io import output_file, save
from bokeh.layouts import column

def extract_extreme_similarities(similarity_matrices, top_n=10):

    all_similarities = []
    
    # Analyze each policy area
    for policy_area, years_data in similarity_matrices.items():
        # Process each year
        for year, sim_matrix in years_data.items():
            # Convert year to string for consistent handling
            year_str = str(year)
            
            # Create a copy and exclude NI group if it exists
            sim_mat = sim_matrix.copy()
            if 'NI' in sim_mat.index:
                sim_mat = sim_mat.drop('NI', axis=0)
            if 'NI' in sim_mat.columns:
                sim_mat = sim_mat.drop('NI', axis=1)
            
            # Skip if matrix is empty after dropping NI
            if sim_mat.empty:
                continue
            
            # Process each EPG pair
            for epg1 in sim_mat.index:
                for epg2 in sim_mat.columns:
                    # Skip diagonal (self-comparisons)
                    if epg1 == epg2:
                        continue
                    
                    # Skip duplicates due to symmetry (only process each pair once)
                    if epg1 > epg2:
                        continue
                    
                    similarity = sim_mat.loc[epg1, epg2]
                    
                    # Skip NaN values
                    if np.isnan(similarity):
                        continue
                    
                    all_similarities.append({
                        'EPG Pair': f"{epg1}-{epg2}",
                        'EPG1': epg1,
                        'EPG2': epg2,
                        'Similarity': similarity,
                        'Year': year_str,
                        'Policy Area': policy_area
                    })
    
    # Convert to DataFrame
    df = pd.DataFrame(all_similarities)
    
    # If no data, return empty results
    if df.empty:
        return {
            'most_related': pd.DataFrame(),
            'least_related': pd.DataFrame()
        }
    
    # Sort by similarity (descending for most related, ascending for least related)
    most_related = df.sort_values('Similarity', ascending=False).head(top_n).copy()
    least_related = df.sort_values('Similarity', ascending=True).head(top_n).copy()
    
    # Format the similarity value
    most_related['Similarity'] = most_related['Similarity'].apply(lambda x: f"{x:.3f}")
    least_related['Similarity'] = least_related['Similarity'].apply(lambda x: f"{x:.3f}")
    
    # Rename columns for clarity
    most_related.rename(columns={'Similarity': 'Most Related'}, inplace=True)
    least_related.rename(columns={'Similarity': 'Least Related'}, inplace=True)
    
    return {
        'most_related': most_related,
        'least_related': least_related
    }

def create_similarity_tables(similarity_matrices, top_n=10):
    # Extract data
    results = extract_extreme_similarities(similarity_matrices, top_n)
    
    # Format data for display
    for key in ['most_related', 'least_related']:
        # Skip if empty
        if results[key].empty:
            continue
        
        # Add EPG colors
        results[key]['Color1'] = results[key]['EPG1'].apply(
            lambda x: epg_info.get(x, {}).get('color', '#CCCCCC')
        )
        results[key]['Color2'] = results[key]['EPG2'].apply(
            lambda x: epg_info.get(x, {}).get('color', '#CCCCCC')
        )
        
        # Format EPG pair with colors
        results[key]['Formatted EPGs'] = results[key].apply(
            lambda row: f"""
                <div style="display:flex; align-items:center;">
                    <div style="width:15px; height:15px; background-color:{row['Color1']}; 
                         border:1px solid #000; margin-right:5px;"></div>
                    <span style="color:{row['Color1']}; font-weight:bold;">{row['EPG1']}</span>
                    <span style="margin:0 5px;">-</span>
                    <div style="width:15px; height:15px; background-color:{row['Color2']}; 
                         border:1px solid #000; margin-right:5px;"></div>
                    <span style="color:{row['Color2']}; font-weight:bold;">{row['EPG2']}</span>
                </div>
            """, 
            axis=1
        )
    
    # Create Bokeh tables
    tables = {}
    
    for key, title in [
        ('most_related', 'Most Related EPG Pairs'), 
        ('least_related', 'Least Related EPG Pairs')
    ]:
        # Skip if empty
        if results[key].empty:
            tables[key] = None
            continue
        
        # Convert to ColumnDataSource
        source = ColumnDataSource(results[key])
        
        # Create columns with fixed widths
        columns = [
            TableColumn(
                field='Formatted EPGs', 
                title='EPG Pair', 
                formatter=HTMLTemplateFormatter(template='<%= value %>'),
                width=200
            ),
            TableColumn(
                field=key.replace('_', ' ').title(), 
                title='Similarity',
                width=100
            ),
            TableColumn(field='Year', title='Year', width=80),
            TableColumn(field='Policy Area', title='Policy Area', width=200)
        ]
        
        # Calculate exact height needed
        num_rows = len(results[key])
        row_height = 35
        header_height = 30
        exact_height = (num_rows * row_height) + header_height
        
        # Create DataTable
        data_table = DataTable(
            source=source,
            columns=columns,
            width=600,
            height=exact_height,
            index_position=None,
            sizing_mode="fixed",
            autosize_mode="none",
            fit_columns=False,
            min_height=exact_height,
            max_height=exact_height
        )
        
        tables[key] = data_table
    
    return tables

def save_similarity_tables(similarity_matrices, output_prefix="epg_similarity", top_n=10):
    """
    Extract and save tables of most and least related EPG pairs.
    
    Parameters:
    -----------
    similarity_matrices : dict
        Dictionary with policy areas as keys and a nested dictionary of years -> similarity matrices
    output_prefix : str
        Prefix for output filenames
    top_n : int
        Number of top/bottom pairs to display
    """
    # Create tables
    tables = create_similarity_tables(similarity_matrices, top_n)
    
    # Define custom HTML template for proper sizing
    html_template = """
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="utf-8">
        <title>{{ title }}</title>
        <style>
            html, body {
                margin: 0;
                padding: 0;
                height: {{ height }}px;
                overflow: hidden;
            }
            .bk-root {
                height: {{ height }}px !important;
                overflow: hidden !important;
            }
        </style>
        {{ resources }}
        {{ script }}
        {{ div }}
    </head>
    <body>
    </body>
    </html>
    """
    
    # Save each table
    for key, table in tables.items():
        if table is None:
            print(f"No data available for {key} table")
            continue
        
        # Get height for template
        num_rows = len(table.source.data['Year'])
        exact_height = (num_rows * 35) + 30
        
        # Prepare template with correct height
        current_template = html_template.replace("{{ height }}", str(exact_height))
        
        # Save to HTML file
        output_file = f"{output_prefix}_{key}.html"
        output_file(output_file, title=key.replace('_', ' ').title())
        
        # Save with custom template
        from bokeh.resources import CDN
        from bokeh.embed import file_html
        
        html = file_html(
            table, 
            CDN, 
            key.replace('_', ' ').title(),
            template=current_template
        )
        
        with open(output_file, "w") as f:
            f.write(html)
        
        print(f"✅ Saved: {output_file}")
        
        # Generate iframe code
        iframe_html = f"""
        <div style="display: flex; justify-content: center; margin-bottom: 20px;">
          <iframe 
            src="/images/{output_file}"
            style="width: 90vw; max-width: 600px; height: {exact_height}px; border: none; overflow: hidden;"
            loading="lazy" 
            scrolling="no">
          </iframe>
        </div>
        """
        
        print(f"\nUse this iframe code for {key}:")
        print(iframe_html)

# Example usage:
# save_similarity_tables(similarity_matrices, "epg_similarity", 10)

# If you just want the processed data without creating tables:
def get_extreme_similarities_data(similarity_matrices, top_n=10):
    """
    Extract and return the top N most and least related EPG pairs as DataFrames.
    """
    results = extract_extreme_similarities(similarity_matrices, top_n)
    
    # Select and reorder columns for display
    columns = ['EPG Pair', 'Most Related' if 'Most Related' in results['most_related'].columns else 'Similarity', 
               'Year', 'Policy Area']
    
    most_related = results['most_related'][columns].copy() if not results['most_related'].empty else pd.DataFrame()
    
    columns = ['EPG Pair', 'Least Related' if 'Least Related' in results['least_related'].columns else 'Similarity', 
               'Year', 'Policy Area']
    least_related = results['least_related'][columns].copy() if not results['least_related'].empty else pd.DataFrame()
    
    return {
        'most_related': most_related,
        'least_related': least_related
    }

new_dict = {k: v for k, v in similarity_matrices.items() if k != 'internal regulations of the ep'}

data = get_extreme_similarities_data(new_dict, 10)
print(data['most_related'])
print(data['least_related'])
data = get_extreme_similarities_data(new_dict, 10)
print(data['most_related'])
print(data['least_related'])

            EPG Pair Most Related  Year           Policy Area
4401  Greens/EFA-REG        1.000  2013             petitions
4022         EPP-REG        0.999  2022   international trade
4026  Greens/EFA-REG        0.999  2022   international trade
4020  EPP-Greens/EFA        0.998  2022   international trade
4407         REG-S&D        0.998  2013             petitions
4402  Greens/EFA-S&D        0.998  2013             petitions
4792  Greens/EFA-S&D        0.998  2021  regional development
1162  Greens/EFA-S&D        0.997  2010     culture education
4807  Greens/EFA-S&D        0.997  2022  regional development
3052  Greens/EFA-S&D        0.996  2011       gender equality
            EPG Pair Least Related  Year  \
2349         IDG-REG         0.000  2018   
2057         EPP-REG         0.000  2017   
3794    S&D-The Left         0.000  2006   
505          IDG-S&D         0.001  2020   
846   Greens/EFA-REG         0.001  2007   
828          EPP-S&D         0.001  2006   
2148      

## Create yes vote percentage matrix over EPGs and policy area for each year

In [202]:
# Step 1: Filter votes with codes 1, 2, 3
df_relevant_votes = df[df['vote'].isin([1, 2, 3])]

# Step 2: Get EPGs present in all years
years = df_relevant_votes['year'].unique()
epgs_by_year = [set(df_relevant_votes[df_relevant_votes['year'] == year]['epg'].dropna().unique()) for year in years]
common_epgs = set.intersection(*epgs_by_year)
epgs = sorted(list(common_epgs))

# Step 3: Get policy areas present in all years
policy_areas_by_year = [set(df_relevant_votes[df_relevant_votes['year'] == year]['policy_area'].dropna().unique()) for year in years]
common_policy_areas = set.intersection(*policy_areas_by_year)
policy_areas = sorted(list(common_policy_areas))

# Step 4: Filter the DataFrame using both EPG and Policy Area
df_filtered = df_relevant_votes[
    (df_relevant_votes['epg'].isin(common_epgs)) &
    (df_relevant_votes['policy_area'].isin(common_policy_areas))
]

# Step 5: Group by year and calculate % of 'yes' (vote==1) per (EPG, policy_area)
df_year = df_filtered.groupby('year')

yes_percentage_matrices = {}

for year, group in df_year:
    yes_percentage_matrix = pd.DataFrame(index=['combined'] + epgs, columns=['combined'] + policy_areas, dtype=float)

    # Combined EPG (i.e. 'combined' row): average for all MEPs by policy area
    for policy_area in policy_areas:
        subset = group[group['policy_area'] == policy_area]
        votes_series = subset['vote'].value_counts()
        yes_votes = votes_series.get(1, 0)
        total_votes = votes_series.sum()
        percentage = yes_votes / total_votes if total_votes > 0 else np.nan
        yes_percentage_matrix.loc['combined', policy_area] = percentage

    # Combined EPG + Combined Policy Area (bottom-right cell)
    subset = group
    votes_series = subset['vote'].value_counts()
    yes_votes = votes_series.get(1, 0)
    total_votes = votes_series.sum()
    percentage = yes_votes / total_votes if total_votes > 0 else np.nan
    yes_percentage_matrix.loc['combined', 'combined'] = percentage

    # For each EPG
    for epg in epgs:
        for policy_area in policy_areas:
            subset = group[(group['epg'] == epg) & (group['policy_area'] == policy_area)]
            votes_series = subset['vote'].value_counts()
            yes_votes = votes_series.get(1, 0)
            total_votes = votes_series.sum()
            percentage = yes_votes / total_votes if total_votes > 0 else np.nan
            yes_percentage_matrix.loc[epg, policy_area] = percentage

        # EPG row's 'combined' column
        subset = group[group['epg'] == epg]
        votes_series = subset['vote'].value_counts()
        yes_votes = votes_series.get(1, 0)
        total_votes = votes_series.sum()
        percentage = yes_votes / total_votes if total_votes > 0 else np.nan
        yes_percentage_matrix.loc[epg, 'combined'] = percentage

    yes_percentage_matrices[year] = yes_percentage_matrix


# Heatmap with EPG vs Policy Area over time

In [248]:
import pandas as pd
import numpy as np
from bokeh.plotting import figure, output_file, save
from bokeh.layouts import column, row
from bokeh.models import (ColumnDataSource, LinearColorMapper, ColorBar, 
                          HoverTool, Slider, CheckboxGroup, CustomJS, 
                          BasicTicker, PrintfTickFormatter, Button, 
                          ColumnDataSource, Toggle)
from bokeh.palettes import RdBu11
from bokeh.io import output_notebook, show

# Create a combined matrix for all years
# First, get all the votes and total counts for each cell
total_yes_counts = {}
total_vote_counts = {}

# Initialize the matrix with the same structure as the year matrices
first_year = list(yes_percentage_matrices.keys())[0]
combined_matrix = pd.DataFrame(index=yes_percentage_matrices[first_year].index,
                               columns=yes_percentage_matrices[first_year].columns,
                               dtype=float)

# For each year, extract the raw vote counts to get accurate aggregations
for year, df_group in df_year:
    for epg in epgs + ['combined']:
        for policy_area in policy_areas + ['combined']:
            # Key for tracking totals
            cell_key = (epg, policy_area)
            
            # Get the appropriate subset of data
            if epg == 'combined' and policy_area == 'combined':
                subset = df_group
            elif epg == 'combined':
                subset = df_group[df_group['policy_area'] == policy_area]
            elif policy_area == 'combined':
                subset = df_group[df_group['epg'] == epg]
            else:
                subset = df_group[(df_group['epg'] == epg) & (df_group['policy_area'] == policy_area)]
            
            # Count votes
            votes_series = subset['vote'].value_counts()
            yes_votes = votes_series.get(1, 0)
            total_votes = votes_series.sum()
            
            # Add to running totals
            if cell_key not in total_yes_counts:
                total_yes_counts[cell_key] = 0
                total_vote_counts[cell_key] = 0
            
            total_yes_counts[cell_key] += yes_votes
            total_vote_counts[cell_key] += total_votes

# Calculate percentages for the combined matrix
for epg in epgs + ['combined']:
    for policy_area in policy_areas + ['combined']:
        cell_key = (epg, policy_area)
        if total_vote_counts[cell_key] > 0:
            combined_matrix.loc[epg, policy_area] = total_yes_counts[cell_key] / total_vote_counts[cell_key]
        else:
            combined_matrix.loc[epg, policy_area] = np.nan

# Create a separate total matrix without 'combined' row and column
total_matrix = combined_matrix.copy()
if 'combined' in total_matrix.index:
    total_matrix = total_matrix.drop('combined', axis=0)
if 'combined' in total_matrix.columns:
    total_matrix = total_matrix.drop('combined', axis=1)

# Add the matrices to our dictionary
yes_percentage_matrices['all_years'] = combined_matrix
yes_percentage_matrices['total'] = total_matrix


In [249]:
print(common_epgs)

{'NI', 'The Left', 'S&D', 'IDG', 'Greens/EFA', 'REG', 'EPP'}


## Create PCA, tranform with procrustes and animate over time for paramter policy area

## Aggregated over all policy areas

## Interactive verison with selecte clustering and policy area

In [246]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from scipy.spatial import procrustes
from bokeh.plotting import figure, save, output_file
from bokeh.models import (ColumnDataSource, LabelSet, Slider, CustomJS, Button, 
                         Range1d, Select, Legend, LegendItem, HoverTool)
from bokeh.layouts import column, row

# Set output to an HTML file
output_file("plots/epg_clustering_interactive.html")

# Function to get color for an EPG (using the predefined colors or default to gray)
def get_epg_color(epg):
    if epg in epg_info:
        return epg_info[epg]['color']
    return '#CCCCCC'  # Default gray for unknown EPGs

# Function to get full name for an EPG
def get_epg_full_name(epg):
    if epg in epg_info:
        return epg_info[epg]['name']
    return epg  # Default to the abbreviation if not found

# Function to get ideology for an EPG
def get_epg_ideology(epg):
    if epg in epg_info:
        return epg_info[epg]['ideology']
    return 'Unknown'  # Default to Unknown if not found

# Filter to only include specific policy areas
filtered_policy_areas = [
    'budgetary control', 
    'agriculture', 
    'culture education', 
    'development', 
    'employment social affairs', 
    'environment public health', 
    'fisheries', 
    'gender equality', 
    'international trade', 
    'legal affairs', 
    'regional development'
]

# Get available policy areas that exist in the data
policy_areas = [area for area in filtered_policy_areas if area in similarity_matrices]

# Create a function that will generate aligned coordinates for a given policy area
def generate_coordinates(policy_area):
    # Get years for this policy area
    years = sorted(similarity_matrices[policy_area].keys())
    
    # Get EPGs from the first year
    epgs = list(similarity_matrices[policy_area][years[0]].index)
    
    # Function to perform PCA on a similarity matrix
    def get_coordinates(similarity_matrix):
        # Convert similarity to distance matrix
        distance_matrix = 1 - similarity_matrix
        
        # Convert to numpy array
        X = distance_matrix.values.astype(float)  # Ensure float data type
        
        # Apply PCA
        model = PCA(n_components=2)
        result = model.fit_transform(X)
        
        # Create DataFrame with results - with enhanced metadata
        df_result = pd.DataFrame({
            'x': result[:, 0].astype(float),
            'y': result[:, 1].astype(float),
            'epg': distance_matrix.index.tolist(),
            'color': [get_epg_color(epg) for epg in distance_matrix.index],
            'full_name': [get_epg_full_name(epg) for epg in distance_matrix.index],
            'ideology': [get_epg_ideology(epg) for epg in distance_matrix.index]
        })
        
        return df_result
    
    # Function to align coordinates with reference using Procrustes analysis
    def align_coordinates(target_df, reference_df):
        # Get common EPGs
        common_epgs = set(target_df['epg']).intersection(set(reference_df['epg']))
        
        if len(common_epgs) < 2:
            # Not enough common points to align
            return target_df
        
        # Extract coordinates for common EPGs
        target_coords = np.array([target_df[target_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
        reference_coords = np.array([reference_df[reference_df['epg'] == epg][['x', 'y']].values[0] for epg in common_epgs])
        
        # Perform Procrustes analysis to align target to reference
        mtx1, mtx2, disparity = procrustes(reference_coords, target_coords)
        
        # Create transformation matrix (scale, rotation, reflection)
        scale = np.sqrt(np.sum(mtx1[0]**2)) / np.sqrt(np.sum(target_coords[0]**2))
        
        # Apply transformation to all points in target_df
        result_df = target_df.copy()
        coords = result_df[['x', 'y']].values.astype(float)  # Ensure float data type
        
        # Scale and center (simplified Procrustes)
        coords_scaled = coords * scale
        
        # Get centroids
        target_centroid = np.mean(target_coords, axis=0)
        reference_centroid = np.mean(reference_coords, axis=0)
        
        # Translate
        coords_transformed = coords_scaled - target_centroid + reference_centroid
        
        # Update dataframe
        result_df['x'] = coords_transformed[:, 0]
        result_df['y'] = coords_transformed[:, 1]
        
        return result_df
    
    # Compute coordinates for the first year (reference)
    reference_data = get_coordinates(similarity_matrices[policy_area][years[0]])
    
    # Compute and align coordinates for all years
    aligned_data = {}
    for year in years:
        # Compute initial coordinates
        year_data = get_coordinates(similarity_matrices[policy_area][year])
        
        # Align with reference
        if year == years[0]:
            aligned_data[year] = year_data  # Reference year doesn't need alignment
        else:
            aligned_data[year] = align_coordinates(year_data, reference_data)
    
    # Find the overall range of data across all years to set consistent plot boundaries
    all_x = []
    all_y = []
    for year_data in aligned_data.values():
        all_x.extend(year_data['x'].tolist())  # Convert to list
        all_y.extend(year_data['y'].tolist())  # Convert to list
    
    x_min, x_max = min(all_x), max(all_x)
    y_min, y_max = min(all_y), max(all_y)
    
    # Add padding (200% zoom - 50% padding on each side)
    padding_x = (x_max - x_min) * 0.5
    padding_y = (y_max - y_min) * 0.5
    x_range = (float(x_min - padding_x), float(x_max + padding_x))  # Ensure float type
    y_range = (float(y_min - padding_y), float(y_max + padding_y))  # Ensure float type
    
    # Prepare data for JavaScript
    js_data = {}
    for year in years:
        # Convert DataFrame to dictionary safely
        year_dict = {}
        for col in aligned_data[year].columns:
            year_dict[col] = aligned_data[year][col].tolist()  # Convert all values to lists
        js_data[str(year)] = year_dict
    
    return {
        'years': years,
        'data': js_data,
        'init_data': aligned_data[years[0]],
        'x_range': x_range,
        'y_range': y_range
    }

# Generate initial data for the first policy area
initial_policy = policy_areas[0]
initial_data = generate_coordinates(initial_policy)

# Create ColumnDataSource
source = ColumnDataSource(initial_data['init_data'])

# Create the figure with fixed range and hover tool
p = figure(width=800, height=600, 
           title=f'EPG Clustering - {initial_policy} ({initial_data["years"][0]})',
           tools="pan,wheel_zoom,box_zoom,reset,save,hover",
           x_range=Range1d(initial_data['x_range'][0], initial_data['x_range'][1]),
           y_range=Range1d(initial_data['y_range'][0], initial_data['y_range'][1]),
           tooltips=[
               ("Group", "@full_name (@epg)"),
               ("Ideology", "@ideology")
           ])

# Add scatter points
circles = p.circle('x', 'y', source=source, size=15, color='color', alpha=0.8, 
                 line_color='black', line_width=1)

## Add labels
#labels = LabelSet(x='x', y='y', text='epg', source=source,
#                 text_font_size='10pt', text_color='black',
#                 x_offset=10, y_offset=-10)
#p.add_layout(labels)

# Create a legend to explain the colors
legend_items = []
for epg, info in epg_info.items():
    if epg in initial_data['init_data']['epg'].values:
        # Create truly invisible glyphs for the legend by using coordinates outside the plot range
        # This ensures they won't be rendered at all on the actual plot
        invisible_glyph = p.circle(
            x=[float('nan')],  # NaN coordinates won't be plotted
            y=[float('nan')], 
            color=info['color'], 
            size=10
        )
        legend_items.append((f"{epg} - {info['name']}", [invisible_glyph]))

legend = Legend(items=legend_items, location="top_left")
legend.click_policy = "hide"  # Make the legend interactive
p.add_layout(legend)

# Create controls
# Policy area dropdown
policy_select = Select(title="Policy Area:", value=initial_policy, options=policy_areas, width=200)

# Create a slider for years
year_slider = Slider(title="Year", start=0, end=len(initial_data['years'])-1, value=0, step=1, width=400)

# Create play/pause button
play_button = Button(label="▶️ Play", button_type="success", width=100)

# Precompute data for all policy areas
all_data = {}
for policy in policy_areas:
    try:
        print(f"Computing PCA for {policy}...")
        all_data[policy] = generate_coordinates(policy)
    except Exception as e:
        print(f"Error computing PCA for {policy}: {e}")
        # Create empty placeholder
        all_data[policy] = {
            'years': [],
            'data': {},
            'init_data': pd.DataFrame(columns=['x', 'y', 'epg', 'color', 'full_name', 'ideology']),
            'x_range': (-1, 1),
            'y_range': (-1, 1)
        }

# JavaScript callback for policy area selection
policy_callback = CustomJS(args=dict(source=source, p=p, year_slider=year_slider, 
                                  all_data=all_data), code="""
    // Get the selected policy area
    const policy = cb_obj.value;
    
    // Get data for this policy
    const policy_data = all_data[policy];
    const years = policy_data.years;
    
    // Update slider
    year_slider.start = 0;
    year_slider.end = years.length - 1;
    year_slider.value = 0;
    
    // Get first year data
    const year = years[0];
    const year_data = policy_data.data[year];
    
    // Update the plot range
    p.x_range.start = policy_data.x_range[0];
    p.x_range.end = policy_data.x_range[1];
    p.y_range.start = policy_data.y_range[0];
    p.y_range.end = policy_data.y_range[1];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - ${policy} (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# JavaScript callback for slider
slider_callback = CustomJS(args=dict(source=source, p=p, policy_select=policy_select, 
                                 all_data=all_data), code="""
    // Get the selected policy
    const policy = policy_select.value;
    
    // Get data for this policy
    const policy_data = all_data[policy];
    const years = policy_data.years;
    
    // Get the selected year index
    const yearIndex = cb_obj.value;
    const year = years[yearIndex];
    
    // Update data from precomputed results
    const year_data = policy_data.data[year];
    
    // Update the source data
    source.data['x'] = year_data['x'];
    source.data['y'] = year_data['y'];
    source.data['epg'] = year_data['epg'];
    source.data['color'] = year_data['color'];
    source.data['full_name'] = year_data['full_name'];
    source.data['ideology'] = year_data['ideology'];
    
    // Update title
    p.title.text = `EPG Clustering - ${policy} (${year})`;
    
    // Trigger update
    source.change.emit();
""")

# Animation callback for play button
animation_callback = CustomJS(args=dict(slider=year_slider, button=play_button), code="""
    if (button.label === "▶️ Play") {
        // Start animation
        button.label = "⏸️ Pause";
        
        // Function to increment slider
        function animate_slider() {
            if (button.label === "⏸️ Pause") {
                let current = slider.value;
                let next = current + 1;
                
                // Loop back to beginning if at the end
                if (next > slider.end) {
                    next = slider.start;
                }
                
                // Update slider value (this will trigger the slider callback)
                slider.value = next;
                
                // Schedule next update
                window.setTimeout(animate_slider, 1000);  // 1 second interval
            }
        }
        
        // Start animation
        animate_slider();
    } else {
        // Pause animation
        button.label = "▶️ Play";
    }
""")

# Connect callbacks
policy_select.js_on_change('value', policy_callback)
year_slider.js_on_change('value', slider_callback)
play_button.js_on_click(animation_callback)

# Create layout
controls = row(
    policy_select,
    year_slider,
    play_button
)

layout = column(
    controls,
    p
)

# Save the visualization to an HTML file
save(layout)

print("Visualization saved to 'epg_clustering_interactive.html'. Open this file in a web browser to interact with the visualization.")


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.


'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=

Computing PCA for budgetary control...
Computing PCA for agriculture...
Computing PCA for culture education...
Computing PCA for development...
Computing PCA for employment social affairs...
Computing PCA for environment public health...
Computing PCA for fisheries...
Computing PCA for gender equality...
Computing PCA for international trade...
Computing PCA for legal affairs...
Computing PCA for regional development...
Visualization saved to 'epg_clustering_interactive.html'. Open this file in a web browser to interact with the visualization.
