<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Full example - Part_3

## The DEMOCRACY INDEX data:

In [3]:
import pandas as pd
link2= "https://en.wikipedia.org/wiki/Democracy_Index" 

# getting the data frame in one step:
demos=pd.read_html(link2,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})

Let's see:

In [7]:
demos[4]

Unnamed: 0,Rank,Δ Rank,Country,Regime type,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties,Overall score,Δ Score
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,10.00,9.64,10.00,10.00,9.41,9.81,0.06
2,2,,Iceland,Full democracy,10.00,8.57,8.89,10.00,9.41,9.37,0.21
3,3,,Sweden,Full democracy,9.58,9.29,8.33,10.00,9.12,9.26,0.13
4,4,,New Zealand,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25,0.01
...,...,...,...,...,...,...,...,...,...,...,...
166,163,,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.35,1.55,0.06
167,164,,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43,
168,165,,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32,
169,166,,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13,


In [14]:
demo=demos[4]

### - Cleaning

In [15]:
demo.columns

Index(['Rank', 'Δ Rank', 'Country', 'Regime type',
       'Elec­toral pro­cess and plura­lism', 'Func­tioning of govern­ment',
       'Poli­tical partici­pation', 'Poli­tical cul­ture', 'Civil liber­ties',
       'Overall score', 'Δ Score'],
      dtype='object')

Keep some columns:

In [16]:
demo=demo.iloc[:,2:-1] #from the third up to the penultimate

In [19]:
#check columns as a list
demo.columns.to_list()

['Country',
 'Regime type',
 'Elec\xadtoral pro\xadcess and plura\xadlism',
 'Func\xadtioning of govern\xadment',
 'Poli\xadtical partici\xadpation',
 'Poli\xadtical cul\xadture',
 'Civil liber\xadties',
 'Overall score']

Get rid of spaces one or more spaces (\s+) and characters  that are not alphanumeric (\W):

In [21]:
demo.columns.str.replace('\W|\s+',"",regex=True)

Index(['Country', 'Regimetype', 'Electoralprocessandpluralism',
       'Functioningofgovernment', 'Politicalparticipation', 'Politicalculture',
       'Civilliberties', 'Overallscore'],
      dtype='object')

Replacing:

In [22]:
# Simplifying column names to facilitate further work:
demo.columns=demo.columns.str.replace('\W|\s+',"",regex=True)

Check if we have some strange value in the categorical columns:

In [24]:
# see 
demo.Regimetype.value_counts()

Authoritarian            57
Flawed democracy         52
Hybrid regime            35
Full democracy           23
Authoritarian regimes     1
Hybrid regimes            1
Full democracies          1
Flawed democracies        1
Name: Regimetype, dtype: int64

The last four values are not ok. Those values in the cells will appear as missing values during formatting.

### - Formatting

In [25]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Country                       171 non-null    object
 1   Regimetype                    171 non-null    object
 2   Electoralprocessandpluralism  171 non-null    object
 3   Functioningofgovernment       171 non-null    object
 4   Politicalparticipation        171 non-null    object
 5   Politicalculture              171 non-null    object
 6   Civilliberties                171 non-null    object
 7   Overallscore                  171 non-null    object
dtypes: object(8)
memory usage: 10.8+ KB


Formatting the  ORDINAL variable. Creating the levels:

In [27]:
from pandas.api.types import CategoricalDtype

# create data type
levelsRegime = CategoricalDtype(categories=correctLevels, ordered=True)
# make the change:
demo.Regimetype=demo.Regimetype.astype(levelsRegime)

Formatting the numerical columns:

In [28]:
demo.iloc[:,2:]

Unnamed: 0,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Overallscore
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,10.00,9.64,10.00,10.00,9.41,9.81
2,10.00,8.57,8.89,10.00,9.41,9.37
3,9.58,9.29,8.33,10.00,9.12,9.26
4,10.00,8.93,8.89,8.75,9.71,9.25
...,...,...,...,...,...,...
166,0.00,0.00,1.67,3.75,2.35,1.55
167,0.00,0.00,2.78,4.38,0.00,1.43
168,1.25,0.00,1.11,1.88,2.35,1.32
169,0.00,0.00,1.67,3.13,0.88,1.13


In [31]:
demo.iloc[:,2:].apply(pd.to_numeric,errors='coerce')

Unnamed: 0,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Overallscore
0,,,,,,
1,10.00,9.64,10.00,10.00,9.41,9.81
2,10.00,8.57,8.89,10.00,9.41,9.37
3,9.58,9.29,8.33,10.00,9.12,9.26
4,10.00,8.93,8.89,8.75,9.71,9.25
...,...,...,...,...,...,...
166,0.00,0.00,1.67,3.75,2.35,1.55
167,0.00,0.00,2.78,4.38,0.00,1.43
168,1.25,0.00,1.11,1.88,2.35,1.32
169,0.00,0.00,1.67,3.13,0.88,1.13


In [37]:
#making the change
demo.iloc[:,2:]=demo.iloc[:,2:].apply(pd.to_numeric,errors='coerce')

Verifying:

In [38]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Country                       171 non-null    object  
 1   Regimetype                    167 non-null    category
 2   Electoralprocessandpluralism  167 non-null    float64 
 3   Functioningofgovernment       167 non-null    float64 
 4   Politicalparticipation        167 non-null    float64 
 5   Politicalculture              167 non-null    float64 
 6   Civilliberties                167 non-null    float64 
 7   Overallscore                  167 non-null    float64 
dtypes: category(1), float64(6), object(1)
memory usage: 9.8+ KB


Get rid of missing values:

In [41]:
demo.dropna().reset_index(drop=True)

Unnamed: 0,Country,Regimetype,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Overallscore
0,Norway,Full democracy,10.00,9.64,10.00,10.00,9.41,9.81
1,Iceland,Full democracy,10.00,8.57,8.89,10.00,9.41,9.37
2,Sweden,Full democracy,9.58,9.29,8.33,10.00,9.12,9.26
3,New Zealand,Full democracy,10.00,8.93,8.89,8.75,9.71,9.25
4,Canada,Full democracy,9.58,8.93,8.89,9.38,9.41,9.24
...,...,...,...,...,...,...,...,...
162,Chad,Authoritarian,0.00,0.00,1.67,3.75,2.35,1.55
163,Syria,Authoritarian,0.00,0.00,2.78,4.38,0.00,1.43
164,Central African Republic,Authoritarian,1.25,0.00,1.11,1.88,2.35,1.32
165,Democratic Republic of the Congo,Authoritarian,0.00,0.00,1.67,3.13,0.88,1.13


In [42]:
finaldemo=demo.dropna().reset_index(drop=True)

In [43]:
finaldemo.to_csv('finaldemo.csv',index=False)