## Our 4 Starting Question we decided to change to a single one:


1. How well can the level of corruption in a European country be quantified?

This questions is too broad and may lack specificity. Also not a good question to answer within a notebook.
The definition of "quantified" needs clarification—are we looking at an index, a model, or a metric?

2. Are there different forms of corruption prevalent in different European countries?

While interesting, it may be challenging to obtain granular, comparable data across countries.
Requires deep qualitative insights, which may not be fully captured by available datasets.

3. What characteristics of a country predict the level of corruption?

Why it's promising:

Allows for quantitative analysis using regression or classification models.
Can leverage socio-economic, political, and governance indicators.
Provides actionable insights for policymakers and organizations.
Well-defined and measurable through publicly available datasets.
Potential Challenges:

Ensuring data quality and avoiding biases in reporting.
Distinguishing correlation from causation.

4. What characteristics of a country predict an increase or decrease in the level of corruption?

Why it's promising:

Focuses on change over time, enabling trend analysis.
Useful for policy evaluation and forecasting.
Encourages a deeper exploration of temporal datasets (e.g., economic reforms, governance improvements, etc.).
Can help identify early warning signs for rising corruption.
Potential Challenges:

Requires time-series data and careful handling of lag effects.
External factors (global economic crises, political events) may introduce noise.


The following question seems the most promising and we decided to only choose this one:

# What characteristics of a european country predict the level of corruption?

Why it's promising:

Allows for quantitative analysis using regression or classification models.
Can leverage socio-economic, political, and governance indicators.
Provides actionable insights for policymakers and organizations.
Well-defined and measurable through publicly available datasets.
Potential Challenges:

Ensuring data quality and avoiding biases in reporting.
Distinguishing correlation from causation.

maybe we could tackle this one but lets find out later: Which country characteristics (e.g., economic, political, social indicators) best predict the level of corruption in European countries, and how do these characteristics relate to any gap between actual and perceived corruption?



# Start

As we have decided to focus on the continent of europe the first challenge will be to create a dataset of all european countries. To make this notebook not substantially long we will provide the finished data **europe_countries** to you. It shows all european countries and their ISO3 Code and ISO2 Code which should allow better preprocessing later.

In [465]:
import numpy as np
import pandas as pd
import os

In [447]:
countries = pd.read_csv("../data/processed/europe_countries.csv")

In [448]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    49 non-null     object
 1   ISO3 Code  49 non-null     object
 2   ISO2 Code  49 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


In [449]:
countries['Country'].unique()

array(['Albania', 'Andorra', 'Austria', 'Belarus', 'Belgium',
       'Bosnia & Herzegovina', 'Bulgaria', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France',
       'Georgia', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland',
       'Italy', 'Kosovo2', 'Kosovo', 'Latvia', 'Liechtenstein',
       'Lithuania', 'Luxembourg', 'Malta', 'Moldova', 'Monaco',
       'Montenegro', 'Netherlands', 'North Macedonia', 'Norway', 'Poland',
       'Portugal', 'Romania', 'Russia', 'San Marino', 'Serbia',
       'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey',
       'Ukraine', 'United Kingdom', 'Vatican City'], dtype=object)

When looking at european countrie numbers, many figures fly arround, our data includes 48 countries, and of course the most promising ones. As we will see later most of the other datasets aint serve data for all countries anyway. Especially for small countries like Vatican or Monaco, etc.

### Corruption Perceptions Index (CPI) from Transparency International.
Data Set that shows preceived corruption of countries and rank them.

Link to data: https://images.transparencycdn.org/images/CPI2023_FullDataSet.zip
https://www.transparency.org/en/news/how-cpi-scores-are-calculated

 0   Economy ISO3                                
 1   Economy Name                                 
 2   Year                                           
 3   Corruption Perceptions Index Rank            
 4   Corruption Perceptions Index Score           
 5   Corruption Perceptions Index Sources        
 6   Corruption Perceptions Index Standard Error

We took the xlsx and transformed it to this more comapct version with only features we need and only european countries. Data points get explained at the end of each processing.

In [450]:
cpi_data = pd.read_csv("../data/processed/CPI.csv")

In [451]:
iso3_europe_full = set(countries["ISO3 Code"])

iso3_europe_cpi = set(cpi_data["Economy ISO3"])

print('Lenght of all europe countries in our list: ',len(iso3_europe_full))
print('Lenght of cpi countriest:', len(iso3_europe_cpi))

print('Countries mssing')

iso3_europe_full-iso3_europe_cpi

Lenght of all europe countries in our list:  49
Lenght of cpi countriest: 42
Countries mssing


{'AND', 'LIE', 'MCO', 'RKS', 'SMR', 'VAT', 'XKX'}

Countries missing: Andora, Liechtenstein, Kosovo, San Marino, Vatikan, Gibraltar, Azerbaijan. We could neclect this countries due to their small size and impact.

In [452]:
cpi_data['Year'].unique()

array([2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,
       2023])

In [453]:
cpi_data.rename(columns={'Economy ISO3': 'ISO3 Code'}, inplace=True)

In [454]:
cpi_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 7 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   ISO3 Code                                    504 non-null    object 
 1   Economy Name                                 504 non-null    object 
 2   Year                                         504 non-null    int64  
 3   Corruption Perceptions Index Rank            491 non-null    float64
 4   Corruption Perceptions Index Score           504 non-null    float64
 5   Corruption Perceptions Index Sources         504 non-null    float64
 6   Corruption Perceptions Index Standard Error  504 non-null    float64
dtypes: float64(4), int64(1), object(2)
memory usage: 27.7+ KB


#### CPI Dataset Summary
* Corruption Perceptions Index Rank (float64):
This represents the ranking of the country based on its Corruption Perceptions Index (CPI) score, with rank 1 being the least corrupt country.
Lower ranks indicate better perceived transparency.

* Corruption Perceptions Index Score (float64):
This score quantifies perceived corruption levels on a scale of 0 to 100, where:
0 = highly corrupt
100 = very clean (low corruption perception)
It is calculated using multiple data sources, standardized, and aggregated.

* Corruption Perceptions Index Sources (float64):
The number of sources used to calculate the CPI score for a given country.
The CPI uses multiple expert assessments and business surveys; having more sources increases reliability.

*  Corruption Perceptions Index Standard Error (float64):
This represents the uncertainty or variability in the CPI score, showing how much variation exists among different data sources.
A lower standard error indicates a more reliable score, while a higher one suggests more disagreement or variability in corruption perceptions across sources.

### GDP Data

We decided to go for GDP per Capita (PPP - Purchasing Power Parity)

Why?

Adjusted for cost of living, making it a fairer comparison across countries.
Helps analyze the standard of living and economic well-being in relation to corruption.
Frequently used in corruption-related studies to measure economic development.

Definition: A country's gross domestic product (GDP) at purchasing power parity (PPP) per capita is the PPP value of all final goods and services produced within an economy in a given year, divided by the average (or mid-year) population for the same year. This is similar to nominal GDP per capita but adjusted for the cost of living in each country.



### Processignt the World Bank Indicators and form one single wpi dataset

### Function to filter the datasets on our european country list.

In [480]:
def get_countries(df, countries_df, iso="ISO3 Code"):
    try:
        
        df.rename(columns={'Country Code': 'ISO3 Code'}, inplace=True)

        
        if iso not in countries_df.columns:
            raise KeyError(f"Column '{iso}' not found in countries DataFrame")
        
        # Process the ISO codes and clean whitespace
        iso_europe = set(countries_df[iso].dropna().astype(str).str.strip())
        iso_data = set(df[iso].dropna().astype(str).str.strip())
        
        # Print missing countries for debugging
        missing_countries = iso_data - iso_europe
        if missing_countries:
            print("Missing countries:", missing_countries)
        
        # Filter the DataFrame based on valid ISO codes
        filtered_df = df[df[iso].isin(iso_europe)]
        
        return filtered_df
    
    except KeyError as err:
        print(f"KeyError: {err}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


### Conversion of the raw csv to a comprimised Data Frame

In [467]:
def convert_worldbank_df(df, value_name):
    
    df =df.drop(labels=["Series Code", "Series Name"], axis= 1)

    #df = df.drop(df.tail(8).index)

    df.columns = df.columns.str.replace(r"\[.*?\]", "", regex=True).str.strip()
    
    df = df.melt(
    id_vars=['Country Name', 'ISO3 Code'],
    var_name='Year',
    value_name=value_name
)
    
    df.reset_index(drop=True, inplace=True)

    
    return df

In [457]:
iso_europe = set(countries["ISO3 Code"].dropna().astype(str).str.strip())
#iso_data = set(rule_of_law["Country Code"].dropna().astype(str).str.strip())

### Loop to create a unified dataset

So as the process would be the same for every dataset of the worldbank we created a loop and combined the end results to have a better overview in the code. You will find a overall description of every data point after the code.

In [474]:
# Define your datasets info
datasets_info = {
    "rule_of_law": {
        "file_path": "rule_of_law",
        "value_name": "Rule_of_law"
    },
    "government_effectiveness": {
        "file_path": "government_effectiveness",
        "value_name": "Gov_effectiveness"
    },
    "control_of_corruption": {
        "file_path": "control_of_corruption",
        "value_name": "Control_of_corruption"
    },
    "fdi": {
        "file_path": "fdi",
        "value_name": "Fdi"
    },
    "gdp_per_capita": {
        "file_path": "gdp_per_capita",
        "value_name": "Gdp_per_capita"
    },
    "gini": {
        "file_path": "gini",
        "value_name": "Gini"
    },
    "unemployement": {
        "file_path": "unemployement",
        "value_name": "Unemployement"
    },
      "political_stability": {
        "file_path": "political_stability",
        "value_name": "Political_stability"
    },
}


In [475]:
def process_dataset(file_path, value_name, countries_df=countries):
    """
    Reads a CSV file and processes it into a standardized format:
      - Reads the CSV
      - Converts it using 'convert_worldbank_df' (user-defined)
      - Matches and filters countries using 'get_countries' (user-defined)
      - Keeps only [Country Name, Country Code, Year, <value_name>]

    Parameters:
        file_path (str): Path to the CSV file (without the .csv extension).
        value_name (str): Name to be assigned to the indicator value column.
        countries_df (pd.DataFrame): A DataFrame containing valid country information
                                     for filtering/mapping.

    Returns:
        pd.DataFrame: Processed DataFrame with columns:
                      ['Country Name', 'Country Code', 'Year', <value_name>]
    """
    full_path = f'../data/raw/{file_path}.csv'
    
    # Check if file exists
    if not os.path.exists(full_path):
        raise FileNotFoundError(f"File not found: {full_path}")
    
    # Read CSV
    df = pd.read_csv(full_path)
    
    df.rename(columns={'Country Code': 'ISO3 Code'}, inplace=True)

        
    # Convert the World Bank style data to a long format with 'Year' and 'value_name'
    df = convert_worldbank_df(df=df, value_name=value_name)
    
    # Filter/match country codes and names with european country list
    df = get_countries(df=df, countries_df=countries_df)


    df = df[['ISO3 Code', 'Year', value_name]]
    
    print(df.shape)
    
    return df


In [476]:
def combine_datasets(datasets_info, countries_df):
    """
    Iterates over all datasets in datasets_info, processes each one,
    and merges them into a single DataFrame on ['Country Code', 'Year'].

    Parameters:
        datasets_info (dict): Dictionary containing dataset info with keys like:
                              {
                                  "rule_of_law": {
                                      "file_path": "rule_of_law",
                                      "value_name": "Rule_of_law"
                                  },
                                  ... 
                              }
        countries_df (pd.DataFrame): A DataFrame containing valid country information.

    Returns:
        pd.DataFrame: Final merged DataFrame with columns:
                      ['Country Code', 'Year', <all value_names>]
    """
    merged_df = None

    for ds_name, info in datasets_info.items():
        file_path = info["file_path"]
        value_name = info["value_name"]
        
        try:
            # Process each dataset
            temp_df = process_dataset(
                file_path=file_path,
                value_name=value_name,
            )
            print(f"Successfully processed: {ds_name}")

            # Ensure only necessary columns are present
            temp_df = temp_df[['ISO3 Code', 'Year', value_name]]
            
            # Merge into our final DF
            if merged_df is None:
                merged_df = temp_df
            else:
                merged_df = pd.merge(
                    merged_df,
                    temp_df,
                    on=['ISO3 Code', 'Year'],
                    how='outer'
                )

        except Exception as e:
            print(f"Error processing {ds_name}: {e}")
            continue

    if merged_df is not None:
        # Merge with the reference countries DataFrame to ensure all countries are included
        merged_df = pd.merge(countries_df[['ISO3 Code']], merged_df, on='ISO3 Code', how='left')
        merged_df.sort_values(by=['ISO3 Code', 'Year'], inplace=True)
        merged_df.reset_index(drop=True, inplace=True)

    return merged_df


In [477]:
# Combine them all
final_df = combine_datasets(datasets_info, countries_df=countries)

# Inspect the merged DataFrame
print(final_df.head())
print(final_df.columns)

Missing countries: {'JEY', 'TJK', 'UZB', 'KGZ', 'AZE', 'KAZ', 'ARM', 'TKM', 'GRL'}
(564, 3)
Successfully processed: rule_of_law
Missing countries: {'JEY', 'TJK', 'UZB', 'KGZ', 'AZE', 'KAZ', 'ARM', 'TKM', 'GRL'}
(564, 3)
Successfully processed: government_effectiveness
Missing countries: {'JEY', 'TJK', 'UZB', 'KGZ', 'AZE', 'KAZ', 'ARM', 'TKM', 'GRL'}
(564, 3)
Successfully processed: control_of_corruption
Missing countries: {'FRO', 'TJK', 'UZB', 'KGZ', 'AZE', 'KAZ', 'GIB', 'ARM', 'CHI', 'TKM', 'IMN', 'GRL'}
(564, 3)
Successfully processed: fdi
Missing countries: {'AZE', 'GIB', '(Europe)'}
(564, 3)
Successfully processed: gdp_per_capita
Missing countries: {'GHA', 'GNQ', 'THA', 'GTM', 'ABW', 'CYM', 'BEN', 'FJI', 'FSM', 'VEN', 'WSM', 'MUS', 'LSO', 'BWA', 'COM', 'BTN', 'BHR', 'GRD', 'ARM', 'SAU', 'SYR', 'TGO', 'PAN', 'GUM', 'NZL', 'STP', 'MWI', 'ZWE', 'GUY', 'IMN', 'MOZ', 'IRN', 'PNG', 'MAC', 'AUS', 'LBN', 'SEN', 'MRT', 'TLS', 'KWT', 'GMB', 'SLE', 'CHN', 'COL', 'OMN', 'SYC', 'KGZ', 'NCL', 'P

In [481]:
final_df[final_df["ISO3 Code"]=="RKS"]

Unnamed: 0,ISO3 Code,Year,Rule_of_law,Gov_effectiveness,Control_of_corruption,Fdi,Gdp_per_capita,Gini,Unemployement,Political_stability
444,RKS,,,,,,,,,


### missing values still needs to be done

## Worldbank Data Indicators

### 1. Basic Information

- **Data Source**: [https://databank.worldbank.org/indicator/NY.GDP.PCAP.CD/1ff4a498/Popular-Indicators?l=en#advancedDownloadOptions](https://databank.worldbank.org/indicator/NY.GDP.PCAP.CD/1ff4a498/Popular-Indicators?l=en#advancedDownloadOptions)
- **Description**: The dataset contains various economic indicators such as GDP per capita, foreign direct investment (FDI), unemployment rates, and governance indicators for European countries from 2012 to 2023.
- **Year/Coverage**: 2012-2023

### 2. Key Variables / Columns

| **Column Name**         | **Type** | **Description**                                                                          |
| ----------------------- | -------- | ---------------------------------------------------------------------------------------- |
| ISO3 Code               | object   | ISO 3 code of the country                                                                |
| Year                    | object   | Year of the recorded value                                                               |
| Rule\_of\_law           | float64  | Measures confidence in legal systems and contract enforcement (range: -2.5 to 2.5).      |
| Gov\_effectiveness      | float64  | Measures the quality of public services and policy implementation (range: -2.5 to 2.5).  |
| Control\_of\_corruption | float64  | Measures perceptions of corruption in public power (range: -2.5 to 2.5).                 |
| Fdi                     | float64  | Foreign direct investment, net inflows as a percentage of GDP.                           |
| Gdp\_per\_capita        | float64  | GDP per capita adjusted for purchasing power parity (PPP).                               |
| Gini                    | float64  | Gini index measuring income inequality (0 = perfect equality, 100 = perfect inequality). |
| Unemployement           | float64  | Total unemployment as a percentage of the total labor force.                             |

### 3. Data Cleaning / Transformation

- **Original Format**: Wide format with years as columns.
- **Filtering**: Retained only European countries.
- **Columns Kept/Removed**: Dropped unnecessary columns like `Series Code` and `Series Name`.
- **Data Type Conversions**: Converted `Year` to string and numeric columns to `float64`.
- **Handling Missing Values**: Missing values are left as NaN to indicate data gaps.

### 4. Data Context & Usage

#### GDP per Capita (PPP - Purchasing Power Parity)

- **Definition**: GDP at purchasing power parity (PPP) per capita is the PPP value of all final goods and services produced within an economy in a given year, divided by the mid-year population.
- **Range**: Measured in international dollars, adjusted for cost of living.
- **Note**: Helps compare living standards across countries.

#### Foreign Direct Investment (FDI), net inflows (% of GDP)

- **Definition**: Measures net inflows of investment to acquire a lasting management interest in an enterprise operating in another economy.
- **Range**: Expressed as a percentage of GDP.
- **Note**: Includes equity capital, reinvestment of earnings, and short/long-term capital.
- **Long definition**: Foreign direct investment are the net inflows of investment to acquire a lasting management interest (10 percent or more of voting stock) in an enterprise operating in an economy other than that of the investor. It is the sum of equity capital, reinvestment of earnings, other long-term capital, and short-term capital as shown in the balance of payments. This series shows net inflows (new investment inflows less disinvestment) in the reporting economy from foreign investors, and is divided by GDP.
Source: International Monetary Fund, International Financial Statistics and Balance of Payments databases, World Bank, International Debt Statistics, and World Bank and OECD GDP estimates.

#### Unemployment Rate (% of total labor force)

- **Definition**: Refers to the share of the labor force that is without work but available for and seeking employment.
- **Range**: Expressed as a percentage.
- **Note**: Definitions of labor force and unemployment differ by country.

#### Gini Index

- **Definition**: Measures the extent to which the distribution of income among individuals deviates from a perfectly equal distribution.
- **Range**: 0 represents perfect equality; 100 implies perfect inequality.
- **Note**: A higher Gini index indicates greater inequality.
- **Long definition**: Gini index measures the extent to which the distribution of income (or, in some cases, consumption expenditure) among individuals or households within an economy deviates from a perfectly equal distribution. A Lorenz curve plots the cumulative percentages of total income received against the cumulative number of recipients, starting with the poorest individual or household. The Gini index measures the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a percentage of the maximum area under the line.

#### Rule of Law

- **Definition**: Captures perceptions of confidence in the legal system, contract enforcement, and likelihood of crime and violence.
- **Range**: -2.5 to 2.5.
- **Note**: Higher values indicate stronger rule of law perceptions.
- **Long definition**: Rule of Law captures perceptions of the extent to which agents have confidence in and abide by the rules of society, and in particular the quality of contract enforcement, property rights, the police, and the courts, as well as the likelihood of crime and violence. Estimate gives the country's score on the aggregate indicator, in units of a standard normal distribution

#### Control of Corruption

- **Definition**: Measures perceptions of how public power is exercised for private gain.
- **Range**: -2.5 to 2.5.
- **Note**: Higher values indicate better control of corruption.
- **Long definition**: Control of Corruption captures perceptions of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as "capture" of the state by elites and private interests. Estimate gives the country's score on the aggregate indicator, in units of a standard normal distribution.

#### Government Effectiveness

- **Definition**: Measures perceptions of the quality of public services and policy implementation.
- **Range**: -2.5 to 2.5.
- **Note**: Higher values indicate more effective governance.
- **Long definition**: Government Effectiveness captures perceptions of the quality of public services, the quality of the civil service and the degree of its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies.
Estimate gives the country's score on the aggregate indicator, in units of a standard normal distribution, i.e.

#### Political Stability and Absence of Violence/Terrorism: Estimate
- **Definition**: Violence/Terrorism measures perceptions of the likelihood of political instability and/or politically-motivated violence, including terrorism.
- **Range**: -2.5 to 2.5. in units of a standard normal distribution,
- **Note**: Higher values indicate more effective governance.