## How well can the level of corruption of a country in Europe be quantified? 

* What differences are there in actual corruption and perceived corruption? 

* Are there different forms of corruption prevalent in different countries in Europe? 

* What characteristics of a country predict the level of corruption? 

* What characteristics of a country predict an increase or decrease in the level of corruption?

• Corruption Perceptions Index (CPI) from Transparency International.
Data Set that shows preceived corruption of countries and rank them.

• World Bank Development Indicators (economic, social, and governance data).
Indicators could be used to look for correlation between them and the corruption score of countries

• European Social Survey (perception-related data).
• OECD Data on governance and public sector integrity.

In [71]:
import pandas as pd
import os

In [72]:
ess_path = "../data/raw/ESS.csv"

In [73]:
ess_data = pd.read_csv(ess_path)


In [74]:
ess_data.tail()

Unnamed: 0,name,essround,edition,proddate,idno,cntry,dweight,pspwght,pweight,anweight,...,trstplc,trstplt,trstprt,trstun,trstsci,loylead,medcrgv,medcrgvc,meprinf,implvdm
530706,ESS11e02,11,2.0,20.11.2024,86379,SK,1.758655,3.777879,0.315904,1.193448,...,10,3,3.0,10,,5.0,,,,
530707,ESS11e02,11,2.0,20.11.2024,86407,SK,1.302318,0.698084,0.315904,0.220528,...,10,4,6.0,6,,2.0,,,,
530708,ESS11e02,11,2.0,20.11.2024,86408,SK,1.262293,3.717124,0.315904,1.174255,...,5,5,5.0,8,,3.0,,,,
530709,ESS11e02,11,2.0,20.11.2024,86426,SK,4.000705,4.000297,0.315904,1.263711,...,2,3,88.0,88,,2.0,,,,
530710,ESS11e02,11,2.0,20.11.2024,86453,SK,1.111078,1.031229,0.315904,0.32577,...,10,7,8.0,7,,5.0,,,,


In [12]:
html_data = pd.read_html("../data/raw/ESS.html")

In [75]:
ess_data["cntry"].unique()

array(['AT', 'BE', 'CH', 'CZ', 'DE', 'DK', 'ES', 'FI', 'FR', 'GB', 'GR',
       'HU', 'IE', 'IL', 'IT', 'LU', 'NL', 'NO', 'PL', 'PT', 'SE', 'SI',
       'EE', 'IS', 'SK', 'TR', 'UA', 'BG', 'CY', 'RU', 'HR', 'LV', 'RO',
       'LT', 'AL', 'XK', 'ME', 'RS', 'MK'], dtype=object)

In [76]:
corruption_raw_data = pd.read_csv("../data/raw/TI-CPI.csv")

In [81]:

corruption_raw_data.head()

Unnamed: 0,Economy ISO3,Economy Name,Indicator ID,Indicator,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,AFG,Afghanistan,TI.CPI.Rank,Corruption Perceptions Index Rank,174.0,175.0,172.0,166.0,169.0,177.0,172.0,173.0,165.0,174.0,150.0,162.0
1,AFG,Afghanistan,TI.CPI.STDERR,Corruption Perceptions Index Standard Error,3.3,3.3,1.29,3.49,1.74,1.39,1.41,2.55,2.44,2.12,6.3,6.24
2,AFG,Afghanistan,TI.CPI.Score,Corruption Perceptions Index Score,8.0,8.0,12.0,11.0,15.0,15.0,16.0,16.0,19.0,16.0,24.0,20.0
3,AFG,Afghanistan,TI.CPI.Sources,Corruption Perceptions Index Sources,3.0,3.0,4.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0
4,AGO,Angola,TI.CPI.Rank,Corruption Perceptions Index Rank,157.0,153.0,161.0,163.0,164.0,167.0,165.0,146.0,142.0,136.0,116.0,121.0


In [80]:
corruption_raw_data = corruption_raw_data.drop(labels=["Attribute 1", "Attribute 2", "Attribute 3", "Partner"], axis=1)

KeyError: "['Attribute 1', 'Attribute 2', 'Attribute 3', 'Partner'] not found in axis"

In [82]:
corruption_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 726 entries, 0 to 725
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Economy ISO3  726 non-null    object 
 1   Economy Name  726 non-null    object 
 2   Indicator ID  726 non-null    object 
 3   Indicator     726 non-null    object 
 4   2012          692 non-null    float64
 5   2013          696 non-null    float64
 6   2014          687 non-null    float64
 7   2015          661 non-null    float64
 8   2016          693 non-null    float64
 9   2017          720 non-null    float64
 10  2018          720 non-null    float64
 11  2019          720 non-null    float64
 12  2020          720 non-null    float64
 13  2021          720 non-null    float64
 14  2022          720 non-null    float64
 15  2023          720 non-null    float64
dtypes: float64(12), object(4)
memory usage: 90.9+ KB


In [54]:
countries = pd.read_csv("../data/processed/europe_countries.csv")

In [59]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    48 non-null     object
 1   ISO3 Code  48 non-null     object
 2   ISO2 Code  48 non-null     object
dtypes: object(3)
memory usage: 1.2+ KB


In [83]:
iso3_europe = set(countries["ISO3 Code"])

In [84]:
corruption_raw_data = corruption_raw_data[corruption_raw_data["Economy ISO3"].isin(iso3_europe)]

In [91]:
corruption_raw_data["Economy ISO3"].nunique()

42

In [92]:
corruption_raw_data.to_csv("../data/processed/CPI.csv", index=False, index_label=False)