
---

# Motivation

### What is your dataset?

We selected a comprehensive dataset of European Parliament voting records. This dataset captures how each Member of the European Parliament (MEP) voted on proposed legislation, along with detailed information about the legislation itself, the MEPs, and their party affiliations. The dataset includes two main file types:

* **RCV (Roll Call Votes):** Contains individual vote records for each MEP, including their country, national party, and European Parliamentary Group (EPG).
* **Voted Docs:** Provides metadata on the legislation being voted on, including the vote date, outcome, and associated policy area.

The data is divided across four parliamentary sessions:

* EP6 (2004–2009)
* EP7 (2009–2014)
* EP8 (2014–2019)
* EP9 (2019–2022)

To enable longitudinal analysis, we merged data from all four sessions. This required extensive preprocessing to reconcile differences in schema and formatting across sessions.

### Why did you choose this particular dataset?

We chose this dataset because it offers a rich foundation for analyzing trends in European politics over time. It enables us to explore whether broader political shifts—such as increasing polarization or the rise of right-leaning ideologies—are observable in parliamentary voting behavior.

### What was your goal for the end user's experience?

Our goal is to provide users with an accessible interface to explore aggregated voting patterns and to present clear, data-driven insights into evolving political dynamics within the European Parliament. We aim to make it easy for users to identify trends related to polarization, such as which parties or policy areas are becoming more divisive, and which remain broadly supported. Ultimately, we want to support informed reflection on how shifts in ideology are shaping legislative decision-making at the EU level.

---


---

# Basic Stats

### Preprocessing

To analyze voting trends in the European Parliament across four sessions (EP6–EP9), we consolidated roll-call vote data (`RCV`) with metadata on each legislative item (`Voted Docs`). Due to inconsistencies across sessions in schema, date formats, and naming conventions, several preprocessing steps were necessary:

* **Name Standardization:** MEP names were cleaned to ensure consistency across sessions, accounting for formatting, punctuation, and Unicode differences.
* **Text Normalization:** Policy area fields and party/EPG names were cleaned to remove noise (e.g., punctuation, inconsistent capitalization).
* **Date Parsing:** Dates appeared in both `dd.mm.yyyy` and `yyyy-mm-dd` formats. A custom parser ensured accurate conversion to a uniform `datetime` format.
* **Schema Harmonization:** Since column names and vote identifiers varied by session (e.g., `euro_act_id` in EP6 vs `Vote ID` in later sessions), we created session-aware mappings to unify data.
* **Party Group Mapping:** EP political group (`EPG`) names were mapped to their common abbreviations (e.g., “Group of the European People’s Party...” → `EPP`) to allow for consistent comparison.
* **Missing Values:** Non-informative entries (e.g., empty vote fields or unknown policy areas) were filtered or imputed based on context.

These steps allowed us to generate a unified dataset spanning 2004–2022 with over **4.8 million individual votes**.

### Dataset Statistics

Key observations from the cleaned data:

* 📅 **Time Coverage:** The dataset spans four European Parliament terms:

  * EP6: 2004–2009
  * EP7: 2009–2014
  * EP8: 2014–2019
  * EP9: 2019–2022

* 🧑‍🤝‍🧑 **MEPs Involved:** \~1,300 unique MEPs from all 27 EU member states.

* 📄 **Votes Recorded:**

  * \~26,000 roll-call votes
  * \~4.8 million MEP-level voting entries (rows)

* 🗳️ **Vote Distribution (Sample):**

  | Vote      | Meaning                    | % of Total     |
  | --------- | -------------------------- | -------------- |
  | 1         | For                        | \~48%          |
  | 2         | Against                    | \~27%          |
  | 3         | Abstain                    | \~12%          |
  | 4 / 5 / 6 | Absent / No vote / Excused | \~13% combined |

* 🏛️ **Top Policy Areas:** After cleaning and consolidating, common topics included:

  * Environment
  * Budget and financial affairs
  * Foreign relations
  * Digital and industry regulation
  * Civil rights and rule of law

* 🧭 **EP Group Participation:** Most votes came from major groups like `EPP`, `S&D`, `Renew`, `ID`, `Greens/EFA`, and `The Left`.

---


In [1]:
import pandas as pd

def clean_name(first_name, last_name):
    import unicodedata
    
    if not isinstance(first_name, str):
        first_name = str(first_name) if first_name is not None else ""
    if not isinstance(last_name, str):
        last_name = str(last_name) if last_name is not None else ""
    
    first_name = first_name.lower().strip()
    last_name = last_name.lower().strip()
    
    def normalize_chars(text):
        text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('ASCII')
        return text
    
    first_name = normalize_chars(first_name)
    last_name = normalize_chars(last_name)
    
    for char in ['-', "'", "`", ".", ",", "&", "'"]:  # Added apostrophe variants
        first_name = first_name.replace(char, ' ')
        last_name = last_name.replace(char, ' ')
    
    while '  ' in first_name:
        first_name = first_name.replace('  ', ' ')
    while '  ' in last_name:
        last_name = last_name.replace('  ', ' ')
        
    first_name = ' '.join(word.capitalize() for word in first_name.split())
    last_name = ' '.join(word.capitalize() for word in last_name.split())
    
    full_name = f"{first_name} {last_name}".strip()
    
    return full_name

def clean_text(text):
    if not isinstance(text, str):
        return text
  
    text = text.lower()
    
    for char in ['&', ',', '-']:
        text = text.replace(char, ' ')
    
    text = text.replace(' and ', ' ')
    
    while '  ' in text:
        text = text.replace('  ', ' ')
    
    return text.strip()    

def process_ep_voting_data(rcv_files, voted_docs_files):

    if len(rcv_files) != len(voted_docs_files):
        raise ValueError("The lists of RCV files and Voted docs files must have the same length")
     
    all_data = []
    
    for i, (rcv_file, voted_doc_file) in enumerate(zip(rcv_files, voted_docs_files)):
        print(f"Processing files {i+1}/{len(rcv_files)}: {rcv_file} and {voted_doc_file}")
        
        if "EP6" in rcv_file:
            ep_session = "EP6"
            vote_start_index = 10
            rcv_data = pd.read_excel(rcv_file, header=1)
        elif "EP7" in rcv_file:
            ep_session = "EP7"
            vote_start_index = 9
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        elif "EP8" in rcv_file:
            ep_session = "EP8"
            vote_start_index = 9
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        elif "EP9" in rcv_file:
            ep_session = "EP9"
            vote_start_index = 10
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
        else:
            ep_session = "Unknown"
            rcv_data = pd.read_excel(rcv_file, sheet_name=0)
            print("UNKNOWN SESSION")

        rcv_data = rcv_data.dropna(how='all')
        
        voted_docs = pd.read_excel(voted_doc_file)


        # Get vote columns headers (index)
        vote_columns = rcv_data.columns[vote_start_index:].tolist()
       
        votes_df = process_votes_ep(rcv_data, voted_docs, vote_columns, ep_session=ep_session)

        print(f"Should be around: {len(rcv_data) * len(voted_docs)}")
        print(f"Got: {len(votes_df)}")      

        # Add EP session information
        votes_df['ep_session'] = ep_session
        
        # Append to the list of results
        all_data.append(votes_df)
    
    # Concatenate all dataframes
    combined_df = pd.concat(all_data, ignore_index=True)
    
    # Perform final cleaning
    return combined_df


def process_votes_ep(rcv_data, voted_docs, vote_columns, ep_session = None):
    """Process voting data for EP7, EP8, EP9 sessions"""

    total_skipped = 0

    if ep_session == 'EP6':
        date = 'date'
        title = 'title'
        policy_area = 'main_policy_name'
        vote_id_key = 'euro_act_id'
        author = 'author_name'

        mep_id_key = 'WebisteEpID'

    else:
        date = 'Date'
        title = 'Title'
        policy_area = 'De'
        vote_id_key = 'Vote ID'
        author = 'Author'

        mep_id_key = 'WebisteEpID'

        if ep_session == 'EP7':
            mep_id_key = 'MEP ID'

        if ep_session == 'EP8':
            policy_area = "De/Policy area"

        elif ep_session == 'EP9':
            policy_area = 'Policy area'
  
    
    # Create a dictionary to map vote IDs to vote information
    vote_info = {}
    for _, row in voted_docs.iterrows():

        vote_info[str(row[vote_id_key])] = {
            'date': row[date],
            'title': row[title],
            'policy_area': row[policy_area],
            'author': author,
        }
    
    # Create a list to store results
    results = []
    
    # Process each MEP's votes
    for _, mep_row in rcv_data.iterrows():
        country = mep_row['Country']
        party = mep_row['Party']
        epg = mep_row['EPG']

        first_name = mep_row['Fname']
        last_name = mep_row['Lname']
        
        mep_id = mep_row[mep_id_key]
    
        # Process each vote for this MEP
        for vote_col in vote_columns:
            
            vote_col = str(vote_col)
            vote_code = f'{ep_session}-{vote_col}' 
            
            if vote_col not in vote_info:
                total_skipped += 1
                continue
            
            try:
                mep_vote = mep_row[str(vote_col)]
            except Exception as e:
                mep_vote = mep_row[int(vote_col)]
            
            if mep_vote == 0:
                continue
                
            info = vote_info[vote_col]
            
            results.append({
                'full name': clean_name(first_name, last_name),
                'country': country,
                'national_party': party,
                'epg': epg,
                'mep_id': mep_id,
                'vote_code': vote_code,
                'vote': mep_vote,
                'date': info['date'],
                'title': info['title'],
                'policy_area': clean_text(info['policy_area']),
            })
    
    print(f"Were not able to match: {total_skipped} votes")
    return pd.DataFrame(results)


In [20]:
import pandas as pd
import numpy as np

def parse_mixed_dates(date_str):
    
    if not isinstance(date_str, str):
        date_str = str(date_str)
    
    date_str = date_str.strip()
    
    try:
        # Check for dates with time component
        if " 00:00:00" in date_str:
            date_str = date_str.replace(" 00:00:00", "")

        if date_str == '18 ian 2007':
            return pd.to_datetime('2007-01-18')
        
        if '.' in date_str:
            # Parse as dd.mm.yyyy
            return pd.to_datetime(date_str, format='%d.%m.%Y')
        elif '-' in date_str:
            # Parse as yyyy-mm-dd
            return pd.to_datetime(date_str, format='%Y-%m-%d')
        elif '/' in date_str:
            try:
                # First try %d/%m/%Y (day/month/4-digit year)
                return pd.to_datetime(date_str, format='%d/%m/%Y')
            except ValueError:
                try:
                    # Then try %d/%m/%y (day/month/2-digit year)
                    return pd.to_datetime(date_str, format='%d/%m/%y')
                except ValueError:
                    # Fall back to pandas default parser with dayfirst=True
                    return pd.to_datetime(date_str, dayfirst=True)
        else:
            # For formats we haven't explicitly handled, use pandas' flexible parser
            return pd.to_datetime(date_str)
            
    except Exception as e:
        print(f"Error parsing date '{date_str}': {e}")
        return pd.NaT  # In case of error
    

def clean_combined_data(df):
    
    df['date'] = df['date'].astype(str).apply(parse_mixed_dates)
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    
    df['policy_area_cleaned'] = df['policy_area'].str.strip().str.lower()

    # Map EPG to ture names:
    epg_mapping = {
        "Group of the European People's Party (Christian Democrats)": 'EPP',
        "Group of the European People's Party (Christian Democrats) and European Democrats": 'EPP',
        'Socialist Group in the European Parliament': 'S&D',
        'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament': 'S&D',
        'Confederal Group of the European United Left - Nordic Green Left': 'The Left',
        'Group of the Greens/European Free Alliance': 'Greens/EFA',
        'Independence/Democracy Group': 'IDG',
        'Europe of freedom and democracy Group': 'IDG',
        'Europe of Freedom and Direct Democracy Group': 'IDG',
        'Europe of Nations and Freedom Group': 'ID',
        'European Conservatives and Reformists Group': 'ECR',
        'Non-attached Members': 'NI',
        'Group of the Alliance of Liberals and Democrats for Europe' : 'REG'
    }
    df['epg'] = df['epg'].replace(epg_mapping)

    # Merge PA 
    df['policy_area'] = df['policy_area'].replace('regioanal development', 'regional development')
    
    return df



In [3]:

voted_docs_files = ["VoteWatch-EP-voting-data_2004-2022/EP6_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP7_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP8_Voted docs.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP9_Voted docs.xlsx"]
rcv_files = ["VoteWatch-EP-voting-data_2004-2022/EP6_RCVs_2022_06_13.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP7_RCVs_2014_06_19.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP8_RCVs_2019_06_25.xlsx", "VoteWatch-EP-voting-data_2004-2022/EP9_RCVs_2022_06_22.xlsx"]

df = process_ep_voting_data(rcv_files, voted_docs_files)

print('Saving uncleaned data')

# Save the combined dataframe
output_file = "ep_voting_data_combined_raw.csv"
df.to_csv(output_file, index=False)

print('Cleaning data')
df = clean_combined_data(df)

print('Saving cleaned data')

output_file = "ep_voting_data_combined_clean.csv"
df.to_csv(output_file, index=False)


Processing files 1/4: VoteWatch-EP-voting-data_2004-2022/EP6_RCVs_2022_06_13.xlsx and VoteWatch-EP-voting-data_2004-2022/EP6_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 5827060
Got: 4759840
Processing files 2/4: VoteWatch-EP-voting-data_2004-2022/EP7_RCVs_2014_06_19.xlsx and VoteWatch-EP-voting-data_2004-2022/EP7_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 5937733
Got: 5233859
Processing files 3/4: VoteWatch-EP-voting-data_2004-2022/EP8_RCVs_2019_06_25.xlsx and VoteWatch-EP-voting-data_2004-2022/EP8_Voted docs.xlsx
Were not able to match: 0 votes
Should be around: 8796216
Got: 7696506
Processing files 4/4: VoteWatch-EP-voting-data_2004-2022/EP9_RCVs_2022_06_22.xlsx and VoteWatch-EP-voting-data_2004-2022/EP9_Voted docs.xlsx


  warn(msg)


Were not able to match: 0 votes
Should be around: 10915249
Got: 9520348
Saving uncleaned data
Cleaning data


In [4]:
# Get all column headers as a list
headers_list = df.columns.tolist()
print("Column headers:")
print(headers_list)

Column headers:
['full name', 'country', 'national_party', 'epg', 'mep_id', 'vote_code', 'vote', 'date', 'title', 'policy_area', 'ep_session', 'year', 'month', 'policy_area_cleaned']


In [None]:
import pandas as pd

try:
    df
except NameError:
    df = pd.read_csv('ep_voting_data_combined.csv', dtype=str)


In [21]:
df_cleaned = clean_combined_data(df)

  return pd.to_datetime(date_str, dayfirst=True)


In [22]:
df = df_cleaned
output_file = "ep_voting_data_combined_clean.csv"
df.to_csv(output_file, index=False)

In [52]:
headers_list = list(df.columns)
print(headers_list)

['full name', 'country', 'national_party', 'epg', 'mep_id', 'vote_code', 'vote', 'date', 'title', 'policy_area', 'ep_session', 'year', 'month', 'policy_area_cleaned']


In [53]:
# Step 1: Find EPGs present in all years
years = df['year'].unique()
epgs_by_year = [set(df[df['year'] == year]['epg'].dropna().unique()) for year in years]

print(epgs_by_year)
# Step 2: Find common EPGs
common_epgs = set.intersection(*epgs_by_year)

# Step 3: Print the result
print(f"EPGs present in all {len(years)} years:")
print(sorted(common_epgs))

[{'Union for Europe of the Nations Group', 'S&D', 'The Left', 'NI', 'REG', 'IDG', 'EPP', 'Greens/EFA'}, {'Union for Europe of the Nations Group', 'S&D', 'The Left', 'NI', 'REG', 'IDG', 'EPP', 'Greens/EFA'}, {'Union for Europe of the Nations Group', 'S&D', 'The Left', 'NI', 'REG', 'IDG', 'EPP', 'Greens/EFA'}, {'Union for Europe of the Nations Group', 'S&D', 'The Left', 'NI', 'REG', 'IDG', 'EPP', 'Greens/EFA'}, {'Union for Europe of the Nations Group', 'S&D', 'The Left', 'NI', 'REG', 'IDG', 'EPP', 'Greens/EFA'}, {'Union for Europe of the Nations Group', 'S&D', 'ECR', 'The Left', 'NI', 'REG', 'IDG', 'EPP', 'Greens/EFA'}, {'S&D', 'ECR', 'The Left', 'REG', 'NI', 'IDG', 'EPP', 'Greens/EFA'}, {'S&D', 'ECR', 'The Left', 'REG', 'NI', 'IDG', 'EPP', 'Greens/EFA'}, {'S&D', 'ECR', 'The Left', 'REG', 'NI', 'IDG', 'EPP', 'Greens/EFA'}, {'S&D', 'ECR', 'The Left', 'REG', 'NI', 'IDG', 'EPP', 'Greens/EFA'}, {'S&D', 'ECR', 'The Left', 'ID', 'REG', 'NI', 'nan', 'IDG', 'EPP', 'Greens/EFA'}, {'S&D', 'ECR', '

In [None]:
from itertools import combinations

def rice_index(yes, no, abstain):
    return abs(yes - no)/(yes + no + abstain)

df['epg'] = df['epg'].astype(str)
df['year'] = df['year'].astype(float).astype(int)


df_year = df.groupby(['year', 'policy_area'])
epgs = sorted(list(common_epgs))

similarity_matrices = {}

for name, group in df_year:
    year, policy_area = name 
    year = int(year)

    sim_matrix = pd.DataFrame(index=epgs, columns=epgs)

    # Calculate similarities between all EPG pairs
    for epg1, epg2 in combinations(epgs, 2):
        ep1_votes = group[group['epg'] == epg1]['vote'].value_counts()
        ep2_votes = group[group['epg'] == epg2]['vote'].value_counts()

        try:
            yes = ep1_votes.get(1) + ep2_votes.get(1)
            no = ep1_votes.get(2) + ep2_votes.get(2)
            abst = ep1_votes.get(3) + ep2_votes.get(3)
        except TypeError as e:
            print(f"Got error for:\n{epg1}\n{epg2}\nEP1 VOTES{ep1_votes}\nEP2 VOTES{ep2_votes}")
            print(e)
            exit

        similarity = rice_index(yes, no, abst)


        sim_matrix.loc[epg1, epg2] = similarity
        sim_matrix.loc[epg2, epg1] = similarity  # Matrix is symmetric

    # Set diagonal to 1 (self-similarity)
    for epg in epgs:
        sim_matrix.loc[epg, epg] = 1.0
            
    # Store the matrix
    matrix_key = (year, policy_area)
    similarity_matrices[matrix_key] = sim_matrix.fillna(0)
    



In [51]:
from itertools import combinations
import numpy as np

def agreement_index(votes1, votes2):
    """
    Calculate agreement between two voting groups based on vote percentage distributions.
    
    Parameters:
    - votes1, votes2: Series of vote counts indexed by vote type (1=for, 2=against, 3=abstention, etc.)
    
    Returns:
    - Similarity score between 0 and 1, based on overlap of percentage distributions
    """
    # Ensure all vote categories exist in both groups
    all_vote_types = set(votes1.index) | set(votes2.index)
    
    # Convert all vote types to float for consistent comparison
    voting_types_int = {1, 2, 3}
    voting_types_float = {1.0, 2.0, 3.0}
    
    # Create a combined set of relevant vote types
    relevant_types = all_vote_types & (voting_types_int | voting_types_float)
    
    # Calculate total votes for each group
    total_votes_1 = sum(votes1.get(vote_type, 0) for vote_type in relevant_types)
    total_votes_2 = sum(votes2.get(vote_type, 0) for vote_type in relevant_types)
    
    # If either group has no votes, there's no way to compare percentages
    if total_votes_1 == 0:
        return '1'
    if total_votes_2 == 0:
        return '2'
    
    total_agreement = 0
    
    # For each vote type, calculate the proportion of agreement based on percentages
    for vote_type in relevant_types:
        # Calculate percentages
        pct1 = votes1.get(vote_type, 0) / total_votes_1
        pct2 = votes2.get(vote_type, 0) / total_votes_2
        
        # Add minimum percentage as the agreement for this vote type
        total_agreement += min(pct1, pct2)
    
    # The sum of all minimum percentages directly represents the overlap
    # between the two distributions (ranges from 0 to 1)
    return total_agreement


# Modified code to use the new similarity function
df['epg'] = df['epg'].astype(str)
df['year'] = df['year'].astype(float).astype(int)
df_year = df.groupby(['year', 'policy_area'])
epgs = sorted(list(common_epgs))
similarity_matrices = {}

for name, group in df_year:
    year, policy_area = name 
    year = int(year)
    sim_matrix = pd.DataFrame(index=epgs, columns=epgs)
    
    # Calculate similarities between all EPG pairs
    for epg1, epg2 in combinations(epgs, 2):
        ep1_votes = group[group['epg'] == epg1]['vote'].value_counts()
        ep2_votes = group[group['epg'] == epg2]['vote'].value_counts()

        similarity = agreement_index(ep1_votes, ep2_votes, epg1, epg2)
        if similarity == '1':
            print(f'Total votes is zero for:\nEPG: {epg1}\nYear: {year}\nPA: {policy_area}')
        if similarity == '2':
            print(f'Total votes is zero for:\nEPG: {epg2}\nYear: {year}\nPA: {policy_area}')
            

        sim_matrix.loc[epg1, epg2] = similarity
        sim_matrix.loc[epg2, epg1] = similarity  # Matrix is symmetric
    
    # Set diagonal to 1 (self-similarity)
    for epg in epgs:
        sim_matrix.loc[epg, epg] = 1.0
            
    # Store the matrix
    if policy_area not in similarity_matrices:
        similarity_matrices[policy_area] = {}
    similarity_matrices[policy_area][year] = sim_matrix


Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero
Total votes is zero


KeyboardInterrupt: 

In [39]:
print(similarity_matrices['budget'][2004].keys())

Index(['EPP', 'Greens/EFA', 'IDG', 'NI', 'REG', 'S&D', 'The Left'], dtype='object')


In [50]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from termcolor import colored
import pandas as pd

def plot_similarity_matrices(similarity_matrices, output_dir=None, colormap='YlGnBu', cmap_reverse=False):
    """
    Plot similarity matrices for each policy area and year.
    
    Parameters:
    - similarity_matrices: Nested dictionary {policy_area: {year: similarity_matrix}}
    - output_dir: Directory to save visualizations (if None, won't save files)
    - colormap: Matplotlib colormap name
    - cmap_reverse: Whether to reverse the colormap
    
    Returns:
    - None, displays matrices in terminal and creates visual plots
    """
    # Create colormap
    cmap = plt.cm.get_cmap(colormap)
    if cmap_reverse:
        cmap = plt.cm.get_cmap(f"{colormap}_r")
    
    # Process each policy area and year
    for policy_area in sorted(similarity_matrices.keys()):
        print(f"\n{'=' * 80}")
        print(f"POLICY AREA: {policy_area}")
        print(f"{'=' * 80}")
        
        for year in sorted(similarity_matrices[policy_area].keys()):
            print(f"\n{'-' * 80}")
            print(f"Year: {year}")
            print(f"{'-' * 80}")
            
            # Get the similarity matrix
            sim_matrix = similarity_matrices[policy_area][year]
            
            # Print matrix in terminal with color coding
            print_matrix_in_terminal(sim_matrix)
            
            # Create and display/save visual plot
            create_heatmap(sim_matrix, policy_area, year, cmap, output_dir)

def print_matrix_in_terminal(matrix):
    """
    Print a similarity matrix in the terminal with color coding.
    """
    # Get matrix dimensions and elements
    epgs = matrix.index
    max_epg_len = max(len(str(epg)) for epg in epgs) + 2
    
    # Print header
    header = ' ' * max_epg_len
    for epg in epgs:
        header += f"{str(epg):<{max_epg_len}}"
    print(header)
    
    # Print each row with color
    for epg1 in epgs:
        row_str = f"{str(epg1):<{max_epg_len}}"
        for epg2 in epgs:
            value = matrix.loc[epg1, epg2]
            # Color coding based on similarity value
            if value >= 0.8:
                color = 'green'
            elif value >= 0.6:
                color = 'cyan'
            elif value >= 0.4:
                color = 'blue'
            elif value >= 0.2:
                color = 'yellow'
            else:
                color = 'red'
                
            colored_value = colored(f"{value:.2f}", color)
            row_str += f"{colored_value:<{max_epg_len}}"
        print(row_str)
    print("\n")

def create_heatmap(matrix, policy_area, year, cmap, output_dir=None):
    """
    Create and save/display a heatmap visualization of the similarity matrix.
    """
    plt.figure(figsize=(10, 8))
    
    # Create heatmap
    sns.heatmap(matrix, annot=True, cmap=cmap, vmin=0, vmax=1, 
                linewidths=.5, fmt='.2f', cbar_kws={'label': 'Similarity'})
    
    # Add labels and title
    plt.title(f'Voting Similarity Matrix - {policy_area} - Year {year}')
    plt.xlabel('European Parliamentary Groups')
    plt.ylabel('European Parliamentary Groups')
    plt.tight_layout()
    
    # Save if output directory is provided
    if output_dir:
        filename = f"{output_dir}/similarity_matrix_{policy_area.replace(' ', '_')}_{year}.png"
        plt.savefig(filename, dpi=300, bbox_inches='tight')
        print(f"Saved visualization to {filename}")
    
    # Display
    plt.show()
    plt.close()

# Example usage:
print(similarity_matrices.keys())
print_matrix_in_terminal(similarity_matrices['budget'][2015])
#plot_similarity_matrices(similarity_matrices, output_dir="./visualizations")

dict_keys(['agriculture', 'budget', 'civil liberties justice home affairs', 'constitutional inter institutional affairs', 'development', 'economics', 'environment public health', 'fisheries', 'foreign security policy', 'internal regulations of the ep', 'juridical affairs', 'petitions', 'transport tourism', 'culture education', 'employment social affairs', 'gender equality', 'industry research energy', 'internal market consumer protection', 'international trade', 'regional development', 'budgetary control', 'economic monetary affairs', 'legal affairs'])
            EPP         Greens/EFA  IDG         NI          REG         S&D         The Left    
EPP         [32m1.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m
Greens/EFA  [31m0.00[0m[32m1.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m
IDG         [31m0.00[0m[31m0.00[0m[32m1.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m[31m0.00[0m
NI          [31m0.00[0m[31m0.