# Case Study: CPI... not Consumer Price Index

### Read Below!

**CPI** is an ancronym that stands for **Corruption Perceptions Index.** The CPI ranks countries "by their perceived levels of corruption, as determined by expert assessments and opinion surveys." This ranking is based upon the idea of corruption as "the misuse of public power for private benefit."

## Part 1: Let's Add Some Imports

### Run the code below...

If you read this text... add a comment to your code saying "I read instructions"

In [None]:
import pandas as pd
import numpy as np
import statistics as st
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')

## Part 2: Let's read file data

We have a file named `'cpiCaseStudyData.csv'` but it should be located in the `data` folder in your `dssg-jupyter` directory. Thus...

### Create a variable called 'filename' containing the filepath '../Data/cpiCaseStudyData.csv'

### Create a DataFrame called 'data' using the read_csv command and filename

Check your "Working with DataFrames" reference sheet if you are having trouble!

### Display your DataFrame 

This DataFrame is very long... so we don't display all of the values when we print it. 

### What years are included in the data set?

### What is the largest recorded cpi in 1998? Which jurisdiction/country has this cpi?

### What is the larest recorded cpi in 2015? Which jurisdiction/country has this cpi?

## Part 3: There's this Code...

### Subsection 3.1 Run the Code Below...

In [None]:
data.index = data['Jurisdiction']
# This procedure takes a DataFrame df and a str year and returns a new Dataframe ret
# The rows in df that had a blank entry in the year column are absent from ret
def trimNaByYear(df, year):
    ret = df
    for index in range(len(df[year])):
        cur = df[year][index]
        if(cur == '-'):
            ret = ret.drop(df['Jurisdiction'][index])
    return ret

# This procedure takes a Dataframe df, a str year, and two numeric types lowestRank and highestRank
# The rows in df whose entries were lower than lowestRank and higher than highestRank are absent from ret
def trimRankingsByYear(df, year, lowestRank, highestRank):
    ret = df
    for index in range(len(df[year])):
        cur = float(df[year][index])
        if(cur < lowestRank or cur > highestRank):
            ret = ret.drop(df['Jurisdiction'][index])
    return ret

# This procedure takes a str year and displays two sorted bar graphs
# The bar graphs displayed contain the cpi rankings of Jurisdictions for the given year
# One graph shows the rankings between zero and the median of the recorded values of the given year
# The other graph shows the remaning rankings of the given year
def cpiBarByYearSorted(year):
    dfTrimmedNa = trimNaByYear(data, year)
    dfTrimmedNa[year] = dfTrimmedNa[year].astype(float)
    trimmedMed = st.median(dfTrimmedNa[year])
    dfTrimmedSorted = dfTrimmedNa.sort_values(year, ascending=True)
    trimRankingsByYear(dfTrimmedSorted, year, 0, trimmedMed).plot.bar(figsize=(20,10))
    trimRankingsByYear(dfTrimmedSorted, year, trimmedMed, max(dfTrimmedNa[year]) + 1).plot.bar(figsize=(20,10))

### Let's use the code above to explore our data

**Run the Code below...** 

In [None]:
cpiBarByYearSorted('2010')

### Discuss the following with your partner.... 

- **What did the code do?**
- **What do the graphs show?** 
- **Is it better to have a smaller or bigger cpi?**
- **Which country was the most corrupt in 2010?**

### Could you use the procedure above to look at cpi rankings from 2015?

Write code below to display bar graphs displaying the cpi rankings from 2015. 

### Optional Extension Questions:

- **Are cpi rankings smaller or bigger in more recent years?**
- **Are there more ranked countries in 2014 or 2015?**

### Subsection 3.2 Run the code below...

In [None]:
# This procedure takes two str year1 and year2 and displays two sorted bar graphs
# The bar graphs displayed contain the cpi rankings of Jurisdictions for the given year
# One graph shows the rankings between zero and the median of the recorded values of the given year
# The other graph shows the remaining rankings of the given year
def cpiBarComparison(year1, year2):
    dfTrimmedNa = trimNaByYear(trimNaByYear(data, year1), year2)
    dfTrimmedNa[year1] = dfTrimmedNa[year1].astype(float)
    dfTrimmedNa[year2] = dfTrimmedNa[year2].astype(float)
    trimmedMed = st.median(dfTrimmedNa[year1])
    dfTrimmedSorted = dfTrimmedNa.sort_values(year1, ascending=True)
    trimRankingsByYear(dfTrimmedSorted, year1, 0, trimmedMed).plot.bar(figsize=(20,5))
    trimRankingsByYear(dfTrimmedSorted, year1, trimmedMed, max(dfTrimmedNa[year1]) + 1).plot.bar(figsize=(20,10))

### Let's use the code above to learn more about our data!

**Run the code below...**

In [None]:
cpiBarComparison('2003', '2013')

### Discuss with your partner...

- **Why display two years of data side by side?**
- **What changed from 2003 to 2013?**

### Display a bar graph with data from 2014 and 2015

Write your code below.

### Do your graphs change if you change the order of parameters in your code?

**Display a bar graph with data from 2015 and 2014** to find out

### Subsection 3.3 Run the code below...

In [None]:
# HELPFUL LIST: used for identifying the availiable years in our data
years = [1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]

# This procedure takes a list jurisdictionLst
# All '-' elements are removed from the list and all values are casted to floats
def dfByJurisdictionToFlt(jurisdictionLst):
    for index in range(len(jurisdictionLst)):
        if(jurisdictionLst[index] == '-'):
            jurisdictionLst[index] = 0.0
        else:
            jurisdictionLst[index] = float(jurisdictionLst[index])
    return jurisdictionLst

# This procedure takes a str jurisdiction and displays a bar graph
# The bar graph displays the availaible cpi rankings of jurisdiction
def cpiBarByJurisdiction(jurisdiction):
    countrycpi = data.loc[jurisdiction].tolist()[1:] # the [1:] text removes the first element of the list
    countrycpiflt = dfByJurisdictionToFlt(countrycpi)
    pd.DataFrame(index=years, data={'cpi': countrycpiflt}).plot.bar(figsize=(20,5))

### Let's use the code above to look at each jurisdiction/country separately

**Run the code below...**

In [None]:
cpiBarByJurisdiction('China')

### How would you write code to display the cpi for Russia?

Write your code below.

*Hint: Look at the code above*

### Subsection 3.4 Run the code below...

In [None]:
# This procedure returns a list of the highest cpi for each year of recorded data
def calculateHighestCpiByYear():
    cols = list(data.columns)[1:]
    cpiMaxes = []
    for col in cols:
        cpiMaxes.append(float(max(data[col].tolist())))
    return cpiMaxes

# HELPFUL LIST: used for calculating relative rankings by year
highestCpiByYear = calculateHighestCpiByYear()

# This procedure takes a str jurisdiction and displays a bar graph
# The bar graph displays the available cpi rankings of jurisdiction out of 1.0
# A low value shows a low ranking compared to available data
# A similar statement can be made about a high ranking
def cpiAvgBarByJurisdiction(jurisdiction):
    countrycpi = data.loc[jurisdiction].tolist()[1:] # the [1:] text removes the first element of the list
    countrycpiflt = dfByJurisdictionToFlt(countrycpi)
    avgCountryCpi = np.divide(countrycpiflt, highestCpiByYear)
    pd.DataFrame(index=years, data={'cpi': avgCountryCpi}).plot.bar(figsize=(20,5))

### Let's look how corrupt each jurisdiction/country is compared to the rest of the recorded jurisdictions/countries

**Run the code below...**

In [None]:
cpiAvgBarByJurisdiction('China')

### Discuss... 

- **Is the graph produced by this code is different from the previous graph we made for China?**
- **If yes, how is it different?**

Be prepared to share your answers with a counselor.

### Run the code below...

In [None]:
cpiAvgBarByJurisdiction('United States')

### Let's figure out what this data shows us.

- **The y-axis has values fromm 0.0 to 0.1.**
- **A value closer to 0 has a corresponding cpi that is lower than the rest of the recorded cpi's for that year**
- **A value closer to 1 has a corresponding cpi that is higher than the rest of the recorded cpi's for that year**


### From the text above...

### Is the US percieved as more or less corrupt than other countries?

Write your answer below.

## Part 4: Exploring the Data

Let's see what we can find.

In [None]:
# Space for exploration

### Here are some example questions...

- **What is the lowest cpi ranking in 2015?**
- **What is the highest cpi ranking in 2015?**
- **Which Jurisdictions consistently have the highest or lowest rankings?**
- **Which Jurisdiction has the mean cpi? What does it mean to have the "mean cpi"?**

### See a trend... Want to know why it exists? Google it! 

Alternatively you could ask a counselor...

In [None]:
# More space for exploration

## Extension: Whatzitdo?

Find out what the following code does...

In [None]:
def arbitrary(lsty):
    for burger in range(len(lsty)):
        if(lsty[burger] == '-'):
            lsty[burger] = 0.0
        else:
            lsty[burger] = float(lsty[burger])
    return lsty

**Hint: try running the code on some inputs to see what it takes**

*Example:*  
    input = ['2', '3', '-']  
    print(arbitrary(input))  