## How well can the level of corruption of a country in Europe be quantified? 

* What differences are there in actual corruption and perceived corruption? 

* Are there different forms of corruption prevalent in different countries in Europe? 

* What characteristics of a country predict the level of corruption? 

* What characteristics of a country predict an increase or decrease in the level of corruption?

• Corruption Perceptions Index (CPI) from Transparency International.
Data Set that shows preceived corruption of countries and rank them.

• World Bank Development Indicators (economic, social, and governance data).
Indicators could be used to look for correlation between them and the corruption score of countries

• European Social Survey (perception-related data).
• OECD Data on governance and public sector integrity.

In [71]:
import pandas as pd
import os

In [72]:
corruption_raw_data = pd.read_csv("../data/processed/CPI.csv")

In [73]:

corruption_raw_data.head()


Unnamed: 0,Economy ISO3,Economy Name,Indicator ID,Indicator,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,ALB,Albania,TI.CPI.Rank,Corruption Perceptions Index Rank,113.0,116.0,110.0,88.0,83.0,91.0,99.0,106.0,104.0,110.0,101.0,98.0
1,ALB,Albania,TI.CPI.STDERR,Corruption Perceptions Index Standard Error,2.0,2.1,1.51,3.58,1.99,1.81,1.65,2.51,0.92,1.33,1.32,1.56
2,ALB,Albania,TI.CPI.Score,Corruption Perceptions Index Score,33.0,31.0,33.0,36.0,39.0,38.0,36.0,35.0,36.0,35.0,36.0,37.0
3,ALB,Albania,TI.CPI.Sources,Corruption Perceptions Index Sources,7.0,7.0,7.0,7.0,7.0,8.0,8.0,8.0,8.0,8.0,8.0,7.0
4,AUT,Austria,TI.CPI.Rank,Corruption Perceptions Index Rank,25.0,26.0,23.0,16.0,17.0,16.0,14.0,12.0,15.0,13.0,22.0,20.0


In [74]:
countries = pd.read_csv("../data/processed/europe_countries.csv")

In [75]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    48 non-null     object
 1   ISO3 Code  48 non-null     object
 2   ISO2 Code  48 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


In [76]:
iso3_europe_all = set(countries["ISO3 Code"])

In [77]:
len(iso3_europe_all)

48

In [78]:
iso3_europe_cpi = set(corruption_raw_data["Economy ISO3"])

In [79]:
len(iso3_europe_cpi)

42

In [80]:
iso3_europe_all-iso3_europe_cpi

{'AND', 'LIE', 'MCO', 'RKS', 'SMR', 'VAT'}

Countries missing: Andora, Liechtenstein, Kosovo, San Marino, Vatikan. I think we dont need this countries due to their size and small impact.

In [81]:
corruption_raw_data = corruption_raw_data[corruption_raw_data["Economy ISO3"].isin(iso3_europe_all)]

corruption_raw_data_melted = corruption_raw_data.melt(
    id_vars=["Economy ISO3", "Economy Name", "Indicator"],  # Columns to keep as identifiers
    value_vars=[str(year) for year in range(2012, 2024)],  # Columns representing years
    var_name="Year",  # New column for years
    value_name="Value"  # New column for values
)

# Pivot the melted DataFrame to make each indicator a column
corruption_transposed = corruption_raw_data_melted.pivot(
    index=["Economy ISO3", "Economy Name", "Year"],  # Columns to keep as index
    columns="Indicator",  # Values to pivot into columns
    values="Value"  # Values to use in the new columns
).reset_index()

# Rename columns for clarity (optional)
corruption_transposed.columns.name = None

In [82]:
corruption_transposed.head()

Unnamed: 0,Economy ISO3,Economy Name,Year,Corruption Perceptions Index Rank,Corruption Perceptions Index Score,Corruption Perceptions Index Sources,Corruption Perceptions Index Standard Error
0,ALB,Albania,2012,113.0,33.0,7.0,2.0
1,ALB,Albania,2013,116.0,31.0,7.0,2.1
2,ALB,Albania,2014,110.0,33.0,7.0,1.51
3,ALB,Albania,2015,88.0,36.0,7.0,3.58
4,ALB,Albania,2016,83.0,39.0,7.0,1.99


In [83]:
corruption_transposed["Economy ISO3"].nunique()

42

In [84]:
corruption_transposed.to_csv("../data/processed/CPI.csv", index=False, index_label=False)

In [85]:
cpi = pd.read_csv("../data/processed/CPI.csv")

In [86]:
cpi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 7 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Economy ISO3                                 504 non-null    object 
 1   Economy Name                                 504 non-null    object 
 2   Year                                         504 non-null    int64  
 3   Corruption Perceptions Index Rank            491 non-null    float64
 4   Corruption Perceptions Index Score           504 non-null    float64
 5   Corruption Perceptions Index Sources         504 non-null    float64
 6   Corruption Perceptions Index Standard Error  504 non-null    float64
dtypes: float64(4), int64(1), object(2)
memory usage: 27.7+ KB
