<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Data Preprocessing in Python: Data formatting

# <font color="red">Case 1: CIA World Factbook</font>

The CIA has an interesting website:
https://www.cia.gov/library/publications/resources/the-world-factbook/

In this case, I will use the report of millions of megatons of carbon dioxide emitted per country available here:
https://www.cia.gov/library/publications/resources/the-world-factbook/fields/274.html



Let me first **collect** the data:

In [None]:
import pandas as pd

link1="https://www.cia.gov/library/publications/resources/the-world-factbook/fields/274.html"

dataco2=pd.read_html(link1,header=0,attrs={'id': 'fieldListing'})

Verify how many elements came with the list:

In [None]:
type(dataco2), len(dataco2) 

Recover data frame from the list:

In [None]:
cia=dataco2[0]

## The cleaning process

* __Checking first rows__ to see if header is in place:

In [None]:
cia.head()

* __Simplifying column names__ to facilitate further work:

In [None]:
# current columns:
cia.columns

In [None]:
# creating dictionary of changes:
OldToNew={cia.columns[0]:'countries',
          cia.columns[1]:'co2'}

In [None]:
# making change happen:
cia.rename(columns=OldToNew,inplace=True)

In [None]:
# current situation
cia.head()

* __Checking last rows__:

In [None]:
cia.tail()

* __Checking cell values__ to see if each cell has the right value:

**a**. Checking "countries": 

Making sure no blanks in country names, this is a _preventive_ measure:

In [None]:
cia.countries=cia.countries.str.strip()

**b**. Checking second column:

**b.1**. Splitting each cell using a particular string of characters, so that the the number and the unit remain.

In [None]:
# first look (notice the blank space before Mt)

cia.co2.str.split(pat=' Mt')

In [None]:
# improving first look: "expand" separates into columns

cia.co2.str.split(' Mt',expand=True)

In [None]:
# keeping the first element of the last result:

cia.co2.str.split(' Mt',expand=True)[0]

In [None]:
# Notice that the previous steps **HAVE NOT** done any changes. I have only displayed the results. 
# Now I will replace the column:

result1=cia.co2.str.split(' Mt',expand=True)[0]

In [None]:
# assign can create or overwrite a column. Then, I use 'result1' here
cia=cia.assign(co2=result1)

In [None]:
# Current situation:
cia

**b.2.** Keep numeric value:

In [None]:
# \d+  one or more digits
# \.?  with or without a dot
# \,?  with or without a comma
# \d*  with zero or more digits

cia.co2.str.extract('(\d+\,*\.*\d*)') #Notice the use of parentheses, they signal a *group* for Pandas:

**3.** Keep string representing unit:

In [None]:
#  a sequence of non digits after a space
# \s before \D+ 
cia.co2.str.extract('\s(\D+)')

In [None]:
## NOTE: Steps **2** and **3** can be done at once:

# simultaneously
cia.co2.str.extract('(\d+\,*\.*\d*)\s(\D+)') # Notice rows indexes **3** and **211**

In [None]:
# Solving previous issue by making the second group conditional (using *s).

cia.co2.str.extract('(\d+\,?\.?\d*)\s*(\D+)*')

In [None]:
# Pandas can give a **name** to the result with **?P < name >**:

cia.co2.str.extract('(?P<number>\d+\,*\.*\d*)\s*(?P<text>\D+)*')

In [None]:
# Notice you have a data frame, let's save it:
result2=cia.co2.str.extract('(?P<number>\d+\,*\.*\d*)\s*(?P<text>\D+)*')

In [None]:
# And let's use the columns of these new data frame:
cia=cia.assign(value=result2.number,
               unit=result2.text)

In [None]:
# Current situation:
cia.head()

**b.4.** Delete symbols in numeric data that could be troublesome in future operations:

In [None]:
# the number have commas, let's get rid of those:
cia.value=cia.value.str.replace(",","")

**b.5.** Replace text in column of **units** for numbers:

In [None]:
# Check what you have:
cia.unit.value_counts(dropna=False)

In [None]:
# create dictionary for replacements:
replacements={'million': 10**6, "billion": 10**9,None:10**0}

In [None]:
# take a loook at the result:
cia.unit.replace(replacements)

In [None]:
# make it happen
cia.unit.replace(replacements,inplace=True)

In [None]:
#Current situation:
cia.head()

**b.6**. Some housekeeping: We do not need the old CO2 column anymore:

In [None]:
# when using 'columns=' or 'index=', axis not needed
# when using 'labels' axis is needed
cia.drop(columns='co2',inplace=True) 

## FORMATTING

### Formatting numeric columns

Formatting makes sure the data can go into statistical work. So the first step is to detect the data types:

In [None]:
cia.dtypes

If you request statistics, you only get:

In [None]:
cia.describe()

The column unit is already a number, because during the cleaning process we created it like that. However, the column _value_ is still text. We can turn it into a numeric one like this:

In [None]:
pd.to_numeric(cia.value)

Then, let's turn the previous result into a real change:

In [None]:
cia=cia.assign(value=pd.to_numeric(cia.value))

This should look the same as before:

In [None]:
cia.head()

But, it is different:

In [None]:
cia.dtypes

Then, you may get more statistics:

In [None]:
cia.describe()

We can now multiply both columns, as each has numbers:

In [None]:
#previous result:
cia.value*cia.unit

That result should be our new CO2:

In [None]:
cia=cia.assign(co2_in_MT=cia.value*cia.unit)

In [None]:
# current situation:
cia.head()

Let's get rid of the second and third column:

In [None]:
# you want this:
cia.drop(columns=['value','unit'])

Let's make the changes:

In [None]:
cia.drop(columns=['value','unit'],inplace=True)

#### The **cia** data frame is clean and formatted.

_______


# <font color="red">Case 2: Democracy Index from wikipedia</font>

In [None]:
demoLink = "https://en.wikipedia.org/wiki/Democracy_Index" 

# getting the data frame in one step:
demodex=pd.read_html(demoLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]

## The cleaning process

* __Checking first rows__ to see if header is in place:

In [None]:
demodex.head(10)

* __Checking last rows__:

In [None]:
demodex.tail(10)

The last row must go, let me erase the **Rank** column at the same time:

In [None]:
#bye row 167, and Rank
demodex=demodex.drop(index=167,columns=['Rank','Score'])

* __Simplifying column names__ to facilitate further work:

In [None]:
demodex.columns

In [None]:
pattern='\s+'
replacement=""
demodex.columns=demodex.columns.str.replace(pattern,replacement)

In [None]:
# current situation:
demodex

* __Checking cell values__ to see if each cell has the right value:

**a.** Let me see if we have some strange value in the numeric columns:

In [None]:
# this is a preventive step!!
badSymbols=[]
NumericColNames=demodex.iloc[:,1:6].columns
for columnName in NumericColNames:
    for cell in demodex[columnName]:
        try:
            float(cell)
        except:
            if cell not in badSymbols:
                badSymbols.append(cell)

This is a preventive cleaning:

In [None]:
import numpy as np
# notice use of loc
demodex.loc[:,NumericColNames].replace(to_replace=badSymbols,value=np.nan,inplace=True)

Since the list is empty, the cell values of the numerical columns are clean. 

**b.** Let me see if we have some strange value in the categorical columns:

In [None]:
demodex.iloc[:,-2::].apply(set).to_list()

No problem there either. It looks good so far. Let's go to formatting.

## FORMATTING

In [None]:
# checking data types:
demodex.dtypes

### Formatting numeric columns

Above, we realized the need to make some indices into numeric. Let's follow these steps:

In [None]:
# save column names of the columns to change:
colsToChange=demodex.iloc[:,1:6].columns

In [None]:
# make changes NOT using iloc:
demodex[colsToChange]=demodex[colsToChange].apply(pd.to_numeric)

### Formatting categorical columns

The *Continent* is a **NOMINAL** column:

In [None]:
demodex.Continent=pd.Categorical(demodex.Continent)

The *Regimetype* is an **ORDINAL** column:

In [None]:
# check the levels:
pd.unique(demodex.Regimetype).tolist()

In [None]:
#rewrite the levels in order:
correctLevels=['Authoritarian', 'Hybrid regime', 'Flawed democracy','Full democracy']

In [None]:
#format as ordinal:
demodex.Regimetype=pd.Categorical(demodex.Regimetype,categories=correctLevels,ordered=True)

The data types have changed:

In [None]:
#then
demodex.dtypes

Regime type is a category, but ordinal:

In [None]:
demodex.Regimetype

#### The **democracy index** data frame is clean and formatted.