## Our 4 Starting Question we decided to change to a single one:


1. How well can the level of corruption in a European country be quantified?

This questions is too broad and may lack specificity. Also not a good question to answer within a notebook.
The definition of "quantified" needs clarification—are we looking at an index, a model, or a metric?

2. Are there different forms of corruption prevalent in different European countries?

While interesting, it may be challenging to obtain granular, comparable data across countries.
Requires deep qualitative insights, which may not be fully captured by available datasets.

3. What characteristics of a country predict the level of corruption?

Why it's promising:

Allows for quantitative analysis using regression or classification models.
Can leverage socio-economic, political, and governance indicators.
Provides actionable insights for policymakers and organizations.
Well-defined and measurable through publicly available datasets.
Potential Challenges:

Ensuring data quality and avoiding biases in reporting.
Distinguishing correlation from causation.

4. What characteristics of a country predict an increase or decrease in the level of corruption?

Why it's promising:

Focuses on change over time, enabling trend analysis.
Useful for policy evaluation and forecasting.
Encourages a deeper exploration of temporal datasets (e.g., economic reforms, governance improvements, etc.).
Can help identify early warning signs for rising corruption.
Potential Challenges:

Requires time-series data and careful handling of lag effects.
External factors (global economic crises, political events) may introduce noise.


The following question seems the most promising and we decided to only choose this one:

# What characteristics of a european country predict the level of corruption?

Why it's promising:

Allows for quantitative analysis using regression or classification models.
Can leverage socio-economic, political, and governance indicators.
Provides actionable insights for policymakers and organizations.
Well-defined and measurable through publicly available datasets.
Potential Challenges:

Ensuring data quality and avoiding biases in reporting.
Distinguishing correlation from causation.

maybe we could tackle this one but lets find out later: Which country characteristics (e.g., economic, political, social indicators) best predict the level of corruption in European countries, and how do these characteristics relate to any gap between actual and perceived corruption?



# Start

As we have decided to focus on the continent of europe the first challenge will be to create a dataset of all european countries. To make this notebook not substantially long we will provide the finished data **europe_countries** to you. It shows all european countries and their ISO3 Code and ISO2 Code which should allow better preprocessing later.

In [1]:
import numpy as np
import pandas as pd

In [2]:
countries = pd.read_csv("../data/processed/europe_countries.csv")

In [24]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    48 non-null     object
 1   ISO3 Code  48 non-null     object
 2   ISO2 Code  48 non-null     object
dtypes: object(3)
memory usage: 1.2+ KB


In [25]:
countries['Country'].unique()

array(['Albania', 'Andorra', 'Austria', 'Belarus', 'Belgium',
       'Bosnia & Herzegovina', 'Bulgaria', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France',
       'Georgia', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland',
       'Italy', 'Kosovo', 'Latvia', 'Liechtenstein', 'Lithuania',
       'Luxembourg', 'Malta', 'Moldova', 'Monaco', 'Montenegro',
       'Netherlands', 'North Macedonia', 'Norway', 'Poland', 'Portugal',
       'Romania', 'Russia', 'San Marino', 'Serbia', 'Slovakia',
       'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey', 'Ukraine',
       'United Kingdom', 'Vatican City'], dtype=object)

When looking at european countrie numbers, many figures fly arround, our data includes 48 countries, and of course the most promising ones. As we will see later most of the other datasets aint serve data for all countries anyway. Especially for small countries like Vatican or Monaco, etc.

### Corruption Perceptions Index (CPI) from Transparency International.
Data Set that shows preceived corruption of countries and rank them.

Link to data: https://images.transparencycdn.org/images/CPI2023_FullDataSet.zip
https://www.transparency.org/en/news/how-cpi-scores-are-calculated

 0   Economy ISO3                                
 1   Economy Name                                 
 2   Year                                           
 3   Corruption Perceptions Index Rank            
 4   Corruption Perceptions Index Score           
 5   Corruption Perceptions Index Sources        
 6   Corruption Perceptions Index Standard Error

We took the xlsx and transformed it to this more comapct version with only features we need and only european countries. Data points get explained at the end of each processing.

In [5]:
cpi_data = pd.read_csv("../data/processed/CPI.csv")

In [13]:
iso3_europe_full = set(countries["ISO3 Code"])

iso3_europe_cpi = set(cpi_data["Economy ISO3"])

print('Lenght of all europe countries in our list: ',len(iso3_europe_full))
print('Lenght of cpi countriest:', len(iso3_europe_cpi))

print('Countries mssing')

iso3_europe_full-iso3_europe_cpi

Lenght of all europe countries in our list:  48
Lenght of cpi countriest: 42
Countries mssing


{'AND', 'LIE', 'MCO', 'RKS', 'SMR', 'VAT'}

Countries missing: Andora, Liechtenstein, Kosovo, San Marino, Vatikan, Gibraltar, Azerbaijan. We could neclect this countries due to their small size and impact.

In [21]:
cpi_data['Year'].unique()

array([2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,
       2023])

In [23]:
cpi_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 7 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Economy ISO3                                 504 non-null    object 
 1   Economy Name                                 504 non-null    object 
 2   Year                                         504 non-null    int64  
 3   Corruption Perceptions Index Rank            491 non-null    float64
 4   Corruption Perceptions Index Score           504 non-null    float64
 5   Corruption Perceptions Index Sources         504 non-null    float64
 6   Corruption Perceptions Index Standard Error  504 non-null    float64
dtypes: float64(4), int64(1), object(2)
memory usage: 27.7+ KB


#### CPI Dataset Summary
* Corruption Perceptions Index Rank (float64):
This represents the ranking of the country based on its Corruption Perceptions Index (CPI) score, with rank 1 being the least corrupt country.
Lower ranks indicate better perceived transparency.

* Corruption Perceptions Index Score (float64):
This score quantifies perceived corruption levels on a scale of 0 to 100, where:
0 = highly corrupt
100 = very clean (low corruption perception)
It is calculated using multiple data sources, standardized, and aggregated.

* Corruption Perceptions Index Sources (float64):
The number of sources used to calculate the CPI score for a given country.
The CPI uses multiple expert assessments and business surveys; having more sources increases reliability.

*  Corruption Perceptions Index Standard Error (float64):
This represents the uncertainty or variability in the CPI score, showing how much variation exists among different data sources.
A lower standard error indicates a more reliable score, while a higher one suggests more disagreement or variability in corruption perceptions across sources.

### GDP Data

We decided to go for GDP per Capita (PPP - Purchasing Power Parity)

Why?

Adjusted for cost of living, making it a fairer comparison across countries.
Helps analyze the standard of living and economic well-being in relation to corruption.
Frequently used in corruption-related studies to measure economic development.

Definition: A country's gross domestic product (GDP) at purchasing power parity (PPP) per capita is the PPP value of all final goods and services produced within an economy in a given year, divided by the average (or mid-year) population for the same year. This is similar to nominal GDP per capita but adjusted for the cost of living in each country.



## GDP Per Capita

### 1. Basic Information
- **Data Source**: https://databank.worldbank.org/indicator/NY.GDP.PCAP.CD/1ff4a498/Popular-Indicators?l=en#advancedDownloadOptions
- **Description**: The dataset contains GDP per capita data (current US$) for European countries from 2012 to 2023.
- **Year/Coverage**: 2012-2023

### 2. Key Variables / Columns

| **Column Name**   | **Type** | **Description**                       |
|------------------|----------|---------------------------------------|
| Country Name     | object   | Name of the country                   |
| Country Code     | object   | ISO 3 code of the country               |
| Year             | object   | Year of the recorded GDP value        |
| GDP_per_capita   | float64  | GDP per capita in current US dollars  |

### 3. Data Cleaning / Transformation
- **Original Format**: Wide format with years as columns.
- **Filtering**: Retained only European countries.
- **Columns Kept/Removed**: Dropped `Series Code` and `Series Name` columns.
- **Data Type Conversions**: Converted `Year` to string and `GDP_per_capita` to float.
- **Handling Missing Values**: -

### 4. Data Context & Usage
- **Why This Dataset**: To analyze economic trends across European countries by measuring GDP per capita over time.
- **Potential Biases**: -
- **Reliability Considerations**: The data's reliability depends on the accuracy of the original source.

### 5. Additional Notes / References
- **Version / Date of Retrieval**: 20.01.25



In [88]:
gdp_ppp = pd.read_csv("../data/raw/gdp_per_capita.csv")

In [71]:

gdp_ppp =gdp_ppp.drop(labels=["Series Code", "Series Name"], axis= 1)

gdp_ppp = gdp_ppp.drop(gdp_ppp.tail(8).index)

gdp_ppp.columns = gdp_ppp.columns.str.replace(r"\[.*?\]", "", regex=True).str.strip()


gdp_ppp.head()


Unnamed: 0,Country Name,Country Code,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Albania,ALB,4247.631343,4413.063383,4578.633208,3952.803574,4124.05539,4531.032207,5365.489347,5460.428237,5370.778623,6413.283286,6846.426143,8575.171134
1,Andorra,AND,41500.543579,42470.316116,44369.659691,38654.93472,40129.819201,40672.994335,42819.77458,41257.804585,37361.090067,42425.699676,42414.059009,46812.448449
2,Austria,AUT,48250.405914,50305.354577,51314.972262,43915.228021,45061.499392,47163.742578,51194.074984,49885.994736,48716.40989,53648.719074,52176.664914,56033.573792
3,Belarus,BLR,6953.215917,7998.080205,8341.290143,5967.06856,5039.775609,5785.533977,6360.053101,6837.768321,6542.86454,7489.718947,7994.648061,7829.053137
4,Belgium,BEL,44874.170918,46964.594678,47995.778696,40893.804538,41854.54983,44035.323936,47487.210039,46716.622747,45906.287581,51655.78833,50807.204708,54700.909324


In [76]:
# Convert wide format to long format
gdp_ppp_long = gdp_ppp.melt(
    id_vars=['Country Name', 'Country Code'],
    var_name='Year',
    value_name='GDP_per_capita'
)

In [83]:
#gdp_ppp_long.sort_values(["Country Name", "Year"]).head()

In [87]:
gdp_ppp_long.to_csv("../data/processed/gdp_per_capita.csv")

In [90]:
gdp_ppp_long.head()

Unnamed: 0,Country Name,Country Code,Year,GDP_per_capita
0,Albania,ALB,2012,4247.631343
1,Andorra,AND,2012,41500.543579
2,Austria,AUT,2012,48250.405914
3,Belarus,BLR,2012,6953.215917
4,Belgium,BEL,2012,44874.170918


### FDI Data

Function to filter the datasets on our european country list.

In [153]:
def get_countries(df, countries_df, iso="ISO3 Code", iso_data_name="Economy ISO3"):
    try:
        # Ensure the required column exists in the input DataFrame
        if iso_data_name not in df.columns:
            raise KeyError(f"Column '{iso_data_name}' not found in DataFrame")
        
        if iso not in countries_df.columns:
            raise KeyError(f"Column '{iso}' not found in countries DataFrame")
        
        # Process the ISO codes and clean whitespace
        iso_europe = set(countries_df[iso].dropna().astype(str).str.strip())
        iso_data = set(df[iso_data_name].dropna().astype(str).str.strip())
        
        # Print missing countries for debugging
        missing_countries = iso_data - iso_europe
        if missing_countries:
            print("Missing countries:", missing_countries)
        
        # Filter the DataFrame based on valid ISO codes
        filtered_df = df[df[iso_data_name].isin(iso_europe)]
        
        return filtered_df
    
    except KeyError as err:
        print(f"KeyError: {err}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


In [139]:
fdi = pd.read_csv("../data/raw/fdi.csv")

In [164]:
#fdi.columns = fdi.columns.str.strip().str.replace("\n", "", regex=True)


In [159]:
fdi = get_countries(df=fdi, countries_df=countries, iso_data_name="Country Code")

Missing countries: {'FRO', 'XKX', 'TJK', 'UZB', 'nan', 'KGZ', 'AZE', 'KAZ', 'GIB', 'ARM', 'CHI', 'TKM', 'IMN', 'GRL'}


In [155]:
def convert_worldbank_df(df, value_name):
    
    df =df.drop(labels=["Series Code", "Series Name"], axis= 1)

    #df = df.drop(df.tail(8).index)

    df.columns = df.columns.str.replace(r"\[.*?\]", "", regex=True).str.strip()
    
    df = df.melt(
    id_vars=['Country Name', 'Country Code'],
    var_name='Year',
    value_name=value_name
)
    
    return df

In [161]:
fdi = convert_worldbank_df(df=fdi, value_name="FDI")

In [162]:
fdi.head()

Unnamed: 0,Country Name,Country Code,Year,FDI
0,Albania,ALB,2012,7.451355
1,Andorra,AND,2012,
2,Austria,AUT,2012,1.283003
3,Belarus,BLR,2012,2.22818
4,Belgium,BEL,2012,2.369569


In [163]:
fdi["Country Code"].nunique()

45

### Unemployment Rate

In [174]:
unemployement = pd.read_csv('../data/raw/unemployement.csv')

In [177]:
unemployement = get_countries(df=unemployement, countries_df=countries, iso_data_name="Country Code")

Missing countries: {'GHA', 'GNQ', 'THA', 'GTM', 'ABW', 'CYM', 'BEN', 'FJI', 'FSM', 'VEN', 'WSM', 'MUS', 'LSO', 'BWA', 'COM', 'BTN', 'BHR', 'GRD', 'ARM', 'SAU', 'SYR', 'TGO', 'PAN', 'XKX', 'GUM', 'NZL', 'STP', 'MWI', 'ZWE', 'GUY', 'IMN', 'MOZ', 'IRN', 'PNG', 'MAC', 'AUS', 'LBN', 'SEN', 'MRT', 'TLS', 'KWT', 'GMB', 'SLE', 'CHN', 'COL', 'OMN', 'SYC', 'KGZ', 'NCL', 'PRK', 'NGA', 'CUB', 'TJK', 'URY', 'CIV', 'MAR', 'ISR', 'YEM', 'PYF', 'USA', 'BHS', 'PRI', 'CPV', 'LAO', 'BRA', 'ATG', 'SWZ', 'TKM', 'JAM', 'KHM', 'TCA', 'BMU', 'PAK', 'LBY', 'DZA', 'CUW', 'BLZ', 'AGO', 'HND', 'GRL', 'SDN', 'MAF', 'MNG', 'AZE', 'SGP', 'SLV', 'COG', 'HKG', 'PHL', 'PSE', 'LKA', 'KAZ', 'KNA', 'TTO', 'VIR', 'ZAF', 'HTI', 'ARG', 'CMR', 'GIB', 'TCD', 'RWA', 'SLB', 'COD', 'CAN', 'NRU', 'TUV', 'EGY', 'KIR', 'BRB', 'UGA', 'TUN', 'CHI', 'MLI', 'NPL', 'LBR', 'VCT', 'NAM', 'MDV', 'NIC', 'IND', 'TZA', 'BRN', 'BGD', 'ETH', 'CHL', 'DOM', 'MEX', 'JOR', 'MNP', 'ERI', 'PLW', 'LCA', 'SUR', 'QAT', 'NER', 'KEN', 'ECU', 'BOL', 'ARE', 

In [170]:
unemployement["Country Code"].nunique()

46

In [175]:
unemployement = convert_worldbank_df(df=unemployement, value_name="Unemployement_rate")

In [178]:
unemployement.head()

Unnamed: 0,Country Name,Country Code,Year,Unemployement_rate
1,Albania,ALB,2012,13.376
4,Andorra,AND,2012,
11,Austria,AUT,2012,4.909
17,Belarus,BLR,2012,0.5
18,Belgium,BEL,2012,7.542
