# Homework 1  | Pandas & Data Provenance
### Assigned Friday, 16 Feb 2018  /  Due Friday, 23 Feb 2018

#### Goal:
The purpose of this homework is to give you more experience working in python and pandas, to heighten your critical acumen when evaluating claims using data, and to make you more sensitive to questions regarding data provenance.  

## Exploring Data
Now let's get some data to explore. Alain Desrosieres argues that the central tension in histories examining the role of statistics in political discourse is that the statistical entities that statistics uses are both real and fabrications: real in that they must be taken as “uncontestable standards” of reference insofar as they serve as  compelling evidence for a particular claim; fabrications in that they are the result of “the provisional and fragile crowning of a series of conventions of equivalence between entities.”[1] The statistical entity of life expectancy, for instance, is real insofar as it serves as a proxy for the health of populations and individuals, and is used to justify disparities in health and life insurance pricing and coverage for different populations. Yet in calculating life expectancy, one quickly discovers not a single computational method, but hundreds--each with a different set of assumptions that yield different results. Deciding which life expectancy estimation to employ is tied up with what the measure will be used to do, and so involves political, ethical, and even moral decisions about who and what should be counted and excluded.[2] Tracing the historical transformation of a statistical entity from a contingent, context-sensitive description into a “universal” property provides insight into the political institutions that created it while also making legible the ways in which a statistical entity exerted a reciprocal pressure back upon the institutions and individuals that created them.[3] Exploring the political implications of statistical entities is further complicated by their historical tendency to be repurposed for use in new arguments. While life expectancy was first developed for assigning and categorizing individuals according to their likelihood of death while their life insurance policy was in force, this statistical category was subsequently put to more sinister purposes: namely, to “demonstrate” the existence of racial biological characteristics and then to serve as “evidence” that race was an appropriate category for screening immigrants.[4]

<b>*Our immediate purpose here is to get some practice using Pandas to explore a data set.*</b>


[1] Alain Desrosieres, <em>The Politics of Large Numbers: A History of Statistical Reasoning</em>, Cambridge, MA: Harvard University Press, 1998, 324-325.

[2] Desrosieres, <em>The Politics of Large Numbers</em>, 325.

[3]Desrosieres, <em>The Politics of Large Numbers</em>, 324.

[4] Dan Bouk, <em>How Our Days Became Numbered: Risk and the Rise of the Statistical Individual</em>, Chicago, IL: University of Chicago Press, 187-188, 201-202.

## Homework Problems 
#### This assignment is to be done on your own. Provide your code to justify your answer to each question.  Be sure to rename this homework so that it includes your name. Finally, please note that your code must run with the "life.expectancy.countries.csv" as originally provided to you. 

### Data Analysis Problems


1) Import the life.expectancy.countries.csv into a pandas dataframe entitled "lifeexpectancy". Rename the columns of the data frame using the list below entitled "column_names". Also be sure to drop the first row of your CSV using the following command: `lifeexpectancy.drop(lifeexpectancy.index[[0]])`  

In [237]:
column_names = ["country", "year", "life expectancy at birth (both sexes)", \
                          "life expectancy at birth (female)", "life expectancy at birth (male)", \
                          "life expectancy at age 60 (both sexes)", "life expectancy at age 60 (female)", \
                          "life expectancy at age 60 (male)"]

import pandas

lifeexpectancy = pandas.read_csv("life.expectancy.countries.csv")
lifeexpectancy = lifeexpectancy.drop(lifeexpectancy.index[[0]])
lifeexpectancy.columns = column_names

lifeexpectancy

Unnamed: 0,country,year,life expectancy at birth (both sexes),life expectancy at birth (female),life expectancy at birth (male),life expectancy at age 60 (both sexes),life expectancy at age 60 (female),life expectancy at age 60 (male)
1,Afghanistan,2015,60.5,61.9,59.3,16.0,16.7,15.3
2,Afghanistan,2014,59.9,61.3,58.6,15.9,16.6,15.2
3,Afghanistan,2013,59.9,61.2,58.7,15.9,16.6,15.2
4,Afghanistan,2012,59.5,60.8,58.3,15.8,16.5,15.1
5,Afghanistan,2011,59.2,60.4,58.0,15.8,16.5,15.1
6,Afghanistan,2010,58.8,60.1,57.7,15.7,16.4,15.0
7,Afghanistan,2009,58.6,59.7,57.5,15.7,16.3,14.9
8,Afghanistan,2008,58.1,59.3,57.0,15.6,16.3,14.9
9,Afghanistan,2007,57.5,58.8,56.4,15.5,16.2,14.8
10,Afghanistan,2006,57.3,58.5,56.3,15.5,16.1,14.8


<b>Important</b>: For the current version of pandas, when you import "life.expectancy.countries.csv" into pandas in the usual manner, it sets all the life expectancy ages (i.e., columns 2 - 7) as "objects" instead of "floats". I'm not sure why it does this, but it will cause problems when you try to plot things. To fix this, be sure to run the following line of code once you've finished question 1 but before you begin question 2: 

In [238]:
lifeexpectancy.loc[:, 'life expectancy at birth (both sexes)':] = lifeexpectancy.loc[:, 'life expectancy at birth (both sexes)':].astype(float)

2) How many different _countries_ do you have data for? How many different years of life expectancy data do you have for each country? Why do they provide life expectancy for people at birth AND at age 60 (i.e., what new insights does this offer)?

Answer:

- We have data for 194 unique countries
- Each country has a different number of years of life expectancy data (see below). For most countries, we have 16 different years of life expectancy data, but for a couple of countries, we have fewer years (Such as Andorra only has 1 year of data).
- Countries provide life expectancy for people at birth AND at age 60 because life expectancy changes the longer you live. At birth In addition, its interesting that we have data at age 60, indicating that maybe 60 is some critical age such that once you've lived up to 60, your life expectancy is longer than what was calculated at birth.

In [239]:
lifeexpectancy.describe() # Shows statistics for entire dataframe

Unnamed: 0,country,year,life expectancy at birth (both sexes),life expectancy at birth (female),life expectancy at birth (male),life expectancy at age 60 (both sexes),life expectancy at age 60 (female),life expectancy at age 60 (male)
count,2939,2939,2928.0,2928.0,2928.0,2939.0,2939.0,2939.0
unique,194,16,393.0,414.0,383.0,142.0,166.0,124.0
top,Spain,2013,73.0,77.4,70.5,16.3,17.0,15.5
freq,16,194,29.0,32.0,31.0,57.0,44.0,69.0


In [240]:
lifeexpectancy.groupby(['country'])['year'].describe() # Group by country and only look at statistics for "year"

Unnamed: 0_level_0,count,unique,top,freq
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,16,16,2013,1
Albania,16,16,2013,1
Algeria,16,16,2013,1
Andorra,1,1,2013,1
Angola,16,16,2013,1
Antigua and Barbuda,16,16,2013,1
Argentina,16,16,2013,1
Armenia,16,16,2013,1
Australia,16,16,2013,1
Austria,16,16,2013,1


3) Using pandas, make a new dataframe that contains all the data for Brazil. (Hint: the following code gives you a general idea of what you need to do: <code> dataframe[dataframe['column title']=='text_in_row']</code>.) 

In [242]:
brazil_expectancy = lifeexpectancy[lifeexpectancy['country'] == 'Brazil']


<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 354 to 369
Data columns (total 8 columns):
country                                   16 non-null object
year                                      16 non-null object
life expectancy at birth (both sexes)     16 non-null object
life expectancy at birth (female)         16 non-null object
life expectancy at birth (male)           16 non-null object
life expectancy at age 60 (both sexes)    16 non-null object
life expectancy at age 60 (female)        16 non-null object
life expectancy at age 60 (male)          16 non-null object
dtypes: object(8)
memory usage: 1.1+ KB


4) Plot life expectancy (from birth, "both sexes") as a function of year for Brazil using the dataframe you constructed in question 3.  

In [345]:
%matplotlib notebook
import matplotlib.pyplot as plt

brazil_expectancy.loc[:, 'year'] = brazil_expectancy.loc[:, 'year'].astype(int)
brazil_expectancy.loc[:, 'life expectancy at birth (both sexes)':] = brazil_expectancy.loc[:, 'life expectancy at birth (both sexes)':].astype(float)

brazil_expectancy.plot.scatter(x="year",
                               y="life expectancy at birth (both sexes)",
                               marker=".")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x2070ab80630>

5) Which country has the highest life expectancy (from birth) for men, women, and both sexes? What are the associated years for each of these life expectancies? (Be sure to show your code!)

Answer:
- The country with the highest life expectancy (from birth) for men is: Switzerland (81.3 years)
- The country with the highest life expectancy (from birth) for women is: Japan (86.8 years)
- The country with the highest life expectancy (from birth) for both sexes is: Japan (83.7 years)

In [245]:
lifeexpectancy[lifeexpectancy['life expectancy at birth (male)'] == lifeexpectancy['life expectancy at birth (male)'].max()] # Highest for males

Unnamed: 0,country,year,life expectancy at birth (both sexes),life expectancy at birth (female),life expectancy at birth (male),life expectancy at age 60 (both sexes),life expectancy at age 60 (female),life expectancy at age 60 (male)
2523,Switzerland,2015,83.4,85.3,81.3,25.5,27,23.9


In [246]:
lifeexpectancy[lifeexpectancy['life expectancy at birth (female)'] == lifeexpectancy['life expectancy at birth (female)'].max()] # Highest for females

Unnamed: 0,country,year,life expectancy at birth (both sexes),life expectancy at birth (female),life expectancy at birth (male),life expectancy at age 60 (both sexes),life expectancy at age 60 (female),life expectancy at age 60 (male)
1316,Japan,2015,83.7,86.8,80.5,26.1,28.7,23.4


In [247]:
lifeexpectancy[lifeexpectancy['life expectancy at birth (both sexes)'] == lifeexpectancy['life expectancy at birth (both sexes)'].max()] # Highest for both sexes

Unnamed: 0,country,year,life expectancy at birth (both sexes),life expectancy at birth (female),life expectancy at birth (male),life expectancy at age 60 (both sexes),life expectancy at age 60 (female),life expectancy at age 60 (male)
1316,Japan,2015,83.7,86.8,80.5,26.1,28.7,23.4


6) Using life expectancy data for "both sexes" from birth, which country has the fastest growing life expectancy on average for all years provided? Likewise, which country has the slowest growing (or even fastest decreasing) life expectancy on average for all years provided? Using pandas, plot the life expectancy of these two countries as a function of year in the same graph.

Answer: The country with fastest growing life expectancy is Eritrea, and slowest is Syrian Arab Republic. We also note that both these countries have somewhat unusual/outlier data points (Eritrea suddenly jumps and SAR suddenly dips).

In [326]:
sorted_year = lifeexpectancy.sort_values(by='year', axis=0, ascending=True)
country_avg = dict()

for country, df in sorted_year.groupby(['country']):
    country_avg[country] = df['life expectancy at birth (both sexes)'].pct_change().mean()

max_c = max(country_avg, key=country_avg.get)
min_c = min(country_avg, key=country_avg.get)

In [330]:
fig = plt.figure()
ax1 = fig.add_subplot(111)

for country, df in lifeexpectancy.groupby(['country']):
    if country == max_c:
        ax1.scatter(x=df['year'], y=df['life expectancy at birth (both sexes)'], marker=".", label=country, color=['blue'])
    
    if country == min_c:
        ax1.scatter(x=df['year'], y=df['life expectancy at birth (both sexes)'], marker=".", label=country, color=['red'])
    
plt.legend()


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x2070bd26160>

7) Pick 3 countries you'd like to compare, and plot their life expectancies (from birth, "both sexes") on the same graph.  

In [332]:
fig2 = plt.figure()
ax2 = fig2.add_subplot(111)

for country, df in lifeexpectancy.groupby(['country']):
    if country == "Norway":
        ax2.scatter(x=df['year'], y=df['life expectancy at birth (both sexes)'], marker=".", label=country, color=['blue'])
    
    if country == "Sweden":
        ax2.scatter(x=df['year'], y=df['life expectancy at birth (both sexes)'], marker=".", label=country, color=['red'])
        
    if country == "Switzerland":
        ax2.scatter(x=df['year'], y=df['life expectancy at birth (both sexes)'], marker=".", label=country, color=['green'])
    
plt.legend()


<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x2070b565358>

8) Plot the _average life expectancy_ for _all_ countries as a function of year.  

In [356]:
# I ended up plotting that average of "life expectancy at birth" for male, female and both sexes.
# Also, due to the number of countries, I wasn't able to fit the legend...

import numpy

fig3 = plt.figure()
ax3 = fig3.add_subplot(111)

lifeexpectancy['avg life expectancy at birth (all)'] = lifeexpectancy[['life expectancy at birth (both sexes)', 'life expectancy at birth (female)', 'life expectancy at birth (male)']].mean(axis=1)

for country, df in lifeexpectancy.groupby(['country']):
    ax3.scatter(x=df['year'], y=df['avg life expectancy at birth (all)'], marker=".", label=country, color=numpy.random.rand(3,))
    
# plt.legend(bbox_to_anchor=(1,1.25), loc='upper left', ncol=1)

<IPython.core.display.Javascript object>

### Exploring the provenance of this data. 

#### The data we've been using in questions 1 - 8 was obtained at http://apps.who.int/gho/data/node.main.688 via the World Health Organization website. You can find a discussion of how they tabulate life expectancy in this PDF[http://www.who.int/healthinfo/statistics/LT_method_1990_2012.pdf] for each country. Use this document and any additional information you feel is relevant to answer the following questions.

9) Take a look at the three countries you examined in question 7. What was the method (or methods) used to calculate life expectancy for the countries you examined? Are the life expectancies for these three countries equally reliable? Were these statistics based on empirical observation or where they approximated? If they were approximated, how was this done? Were the statistics smoothed? Be specific and explain/justify your reasoning. (350 words max) 

Answer: From the WHO document, the countries used the following methods with the associated year availability and the last year used:

- Norway: Method A (1950-2012, 2012)
- Sweden: Method A (1950-2012, 2012)
- Switzerland: Method B (1950-2011, 2011)

For two of the countries in particular, the life expectancies are approximately equally reliable since Norway/Sweden use the same methodology and have the same range of years for which data was available. Method A comprised life tables based on death rates from civil registration data and the data was not smoothed. Switzerland was calculated using method B, a projection of life table parameter from adjusted civil registration data, smoothed with moving average. Compared to Norway and Sweden, who used empirical observation ("raw" civil registration data), Switzerland's life expectancy calculations was approximated, and thus may not exactly reflect raw data.

10) Read sections 4 and 5 of the above-mentioned report. In general, do the life expectancies calculated by WHO differ from official estimates provided by the countries themselves? What are some of the general methods and/or sources used to calculate life expectancy? What are some sources of error? (200 words max)

Answer: From the report, we learn that death registration data as a whole suffers from recording bias, such as incomplete recording of deaths at older ages, misreported age and issues with population estimates for older age groups. Thus, the WHO calculations apply estimation techniques (Thatcher-Kannisto method) that attempt to retroactively "unbias" the data. For countries wehre registration data was available for 2012, they used the data for completeness, but for countries such as Switzerland that did not provide vital registration data for 2012, WHO used life table parameters projected from data from avaialable years.

11) Based on your answers from questions 9 and 10, how much confidence do you have in the graph you made for question 7? Is the comparison of life expectancies between the countries you selected in question 7 meaningful? Why or why not? Be specific.  (300 words max)

Answer: Based on the answers to 9 and 10, I am a bit more wary of the data in the graph I made for question 7. In particular, I would be concerned about how much bias is introduced from misreporting and other sources of error present in the death registration records. Since both Norway and Sweden are tabulated via the same methods (Method A), and are countries with similar characteristics, I would expect that the comparison between them is reasonable. To be more meaningful, the entire graph should compare countries with the same tabulation methods and data availability, ie. as many possible deviances should be held constant. In addition, I think it's also important to have a gauage of error measures ie. how much potential bias may be included in the data and WHO tabulations.