## How well can the level of corruption of a country in Europe be quantified? 

* What differences are there in actual corruption and perceived corruption? 

* Are there different forms of corruption prevalent in different countries in Europe? 

* What characteristics of a country predict the level of corruption? 

* What characteristics of a country predict an increase or decrease in the level of corruption?

• Corruption Perceptions Index (CPI) from Transparency International.
Data Set that shows preceived corruption of countries and rank them.

• World Bank Development Indicators (economic, social, and governance data).
Indicators could be used to look for correlation between them and the corruption score of countries

• European Social Survey (perception-related data).
• OECD Data on governance and public sector integrity.

In [125]:
import pandas as pd
import os

In [126]:
corruption_raw_data = pd.read_csv("../data/processed/CPI.csv")

In [127]:

corruption_raw_data.head()

Unnamed: 0,Economy ISO3,Economy Name,Year,Corruption Perceptions Index Rank,Corruption Perceptions Index Score,Corruption Perceptions Index Sources,Corruption Perceptions Index Standard Error
0,ALB,Albania,2012,113.0,33.0,7.0,2.0
1,ALB,Albania,2013,116.0,31.0,7.0,2.1
2,ALB,Albania,2014,110.0,33.0,7.0,1.51
3,ALB,Albania,2015,88.0,36.0,7.0,3.58
4,ALB,Albania,2016,83.0,39.0,7.0,1.99


In [128]:
countries = pd.read_csv("../data/processed/europe_countries.csv")

In [129]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Country    48 non-null     object
 1   ISO3 Code  48 non-null     object
 2   ISO2 Code  48 non-null     object
dtypes: object(3)
memory usage: 1.3+ KB


In [130]:
iso3_europe_all = set(countries["ISO3 Code"])

In [131]:
len(iso3_europe_all)

48

In [132]:
iso3_europe_cpi = set(corruption_raw_data["Economy ISO3"])

In [133]:
len(iso3_europe_cpi)

42

In [134]:
iso3_europe_all-iso3_europe_cpi

{'AND', 'LIE', 'MCO', 'RKS', 'SMR', 'VAT'}

Countries missing: Andora, Liechtenstein, Kosovo, San Marino, Vatikan. I think we dont need this countries due to their size and small impact.

In [135]:
corruption_raw_data = corruption_raw_data[corruption_raw_data["Economy ISO3"].isin(iso3_europe_all)]


In [136]:
corruption_raw_data.head()

Unnamed: 0,Economy ISO3,Economy Name,Year,Corruption Perceptions Index Rank,Corruption Perceptions Index Score,Corruption Perceptions Index Sources,Corruption Perceptions Index Standard Error
0,ALB,Albania,2012,113.0,33.0,7.0,2.0
1,ALB,Albania,2013,116.0,31.0,7.0,2.1
2,ALB,Albania,2014,110.0,33.0,7.0,1.51
3,ALB,Albania,2015,88.0,36.0,7.0,3.58
4,ALB,Albania,2016,83.0,39.0,7.0,1.99


In [137]:
corruption_raw_data["Economy ISO3"].nunique()

42

In [138]:
corruption_raw_data.to_csv("../data/processed/CPI.csv", index=False, index_label=False)

In [139]:
cpi = pd.read_csv("../data/processed/CPI.csv")

In [140]:
cpi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 7 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Economy ISO3                                 504 non-null    object 
 1   Economy Name                                 504 non-null    object 
 2   Year                                         504 non-null    int64  
 3   Corruption Perceptions Index Rank            491 non-null    float64
 4   Corruption Perceptions Index Score           504 non-null    float64
 5   Corruption Perceptions Index Sources         504 non-null    float64
 6   Corruption Perceptions Index Standard Error  504 non-null    float64
dtypes: float64(4), int64(1), object(2)
memory usage: 27.7+ KB
