# Notebook for analyzing some data
In this notebook, we download some __synthetic__ public-use data on households in Denmark and do some analysis.

The data is from Eurostat, click [here](https://ec.europa.eu/eurostat/web/microdata/public-microdata) for more information.

In [None]:
%matplotlib inline
import numpy as np #Load a module which is good for everthing to do with numbers
import pandas as pd #Load a module for working with data sets
import matplotlib.pyplot as plt #Load a module good for plotting

dataset = pd.read_csv('https://www.karlharmenberg.com/temp/DK_2013h_EUSILC.csv')
#I uploaded the data set to my web page in preparation for this class

Looking at the data set using pandas is easy. We just the write the name of the data set and hit shift-enter.

Hmm, easy but maybe not so enlightening. What does "HB030" mean?

After clicking around the Eurostat homepage, we come across the [manual](https://circabc.europa.eu/sd/a/d7e88330-3502-44fa-96ea-eab5579b4d1e/SILC065%20operation%202013%20VERSION%20MAY%202013.pdf).

Now, looking at the manual, what does "HB030" mean?

Can you find household gross income? Let's read the definition.

## The distribution of household gross income
Showing the distribution of household gross income is easy, we just write the following commands:

Let's compute the distribution of log income as well.

How many households have negative gross income?
How many households earn more than 1,000,000?

Now, let's draw the Lorenz curve. The way to do this is to:
* Compute total income
* Sort all the individuals for lowest to highest income and compute how much the first $n$ individuals earn, for all $n$.

In Python, we define a _function_.

In [None]:
def lorenz_curve(list_of_incomes):
    total_income = sum(list_of_incomes)    
    list_of_incomes_sorted = np.sort(list_of_incomes)
    cumulative_income = np.cumsum(list_of_incomes_sorted)
    return cumulative_income/total_income

Now, we apply this function to our data and plot the results.

Finally, we compute the Gini coefficient and the coefficient of variation:

In [None]:
def gini_coeff(x):
    # requires all values in x to be zero or positive numbers,
    # otherwise results are undefined
    n = len(x)
    s = x.sum()
    r = np.argsort(np.argsort(-x)) # calculates zero-based ranks
    return 1 - (2.0 * (r*x).sum() + s)/(n*s)

print("Gini = ",gini_coeff(dataset['HY010']))
print("CoV = ", dataset['HY010'].std()/dataset['HY010'].mean())