# Intro
The data in this lesson was obtained from the site gapminder.org. The variables included are:
- Aged 15+ Employment Rate (%)
- Life Expectancy (years)
- GDP/capita (US$, inflation adjusted)
- Primary school completion (% of boys)
- Primary school completion (% of girls)

In this process we will not focus on data analysis process. Instead we will do buch of exercises showing what NumPy and Pandas can do. 

# Questions

How Polish employment rate changed compared to old EU average?

What are the highest and the lowest employment levels?
- Which countires have them?
- Where is Poland on that spectrum?

How these variables relate to each other?

Are there consistent trends across countries?

---

Which countries have the biggest and the lowest life expectancy and what is it?

What are 10 coutries with highest GDP and what are 10 with the lowest?

How GDP changes over years - world average?

How GDP changes over years - Europe average?

How primary school completion differs in Europe? 

How primary school completion differs between sexes in Europe, Africa and Middle East?


How Polish employment rate changed over time?



## Compare Python code from Lesson 1 to Pandas and NumPy data loading

In [32]:
import unicodecsv


In [33]:
def read_csv(path):
    with open(path, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        data = list(reader)
    return data

daily_engagement = read_csv("daily_engagement.csv")

In [34]:
def get_unique_students(data):
     return set([data_row['acct'] for data_row in data])

unique_engagement_students = get_unique_students(daily_engagement)
len(unique_engagement_students)

1237

### NymPy

In [15]:
import numpy as np

In [27]:
daily_engagement = np.loadtxt("daily_engagement.csv", dtype=str, delimiter=",", skiprows=1)

In [31]:
np.unique(daily_engagement[:, 0])

array(["b'0'", "b'1'", "b'10'", ..., "b'995'", "b'998'", "b'999'"], 
      dtype='<U16')

### Pandas

In [10]:
import pandas as pd

In [19]:
daily_engagement = pd.read_csv("daily_engagement.csv")

In [12]:
len(daily_engagement['acct'].unique())

1237

In [70]:
# time it!

## One-dimensional data structures

Both Pandas and NumPy have special data structures, made to represent one-dimesional data: Array (NumPy) and Series (Pandas).

NumPy Array:
- are simpler 

Pandas Series: 
- have more features then Pandas Series
- built on top of NumPy arrays

Both NumPy and Pandas have data structures to represent two-dimensional data.

## NumPy arrays and Python Lists

In many ways NumPy Arrays are similar to Python Lists:
- it contains a sequence of elements and those can be anything 

Similarities:
- Access elements by position: data[0]
- Access a range of elements using slicing: data[4:10]
- Use for loops: for x in data

Differences:
- Each element should have same type (string, int, boolean, etc.)
- Convenient functions: mean(), std()
- Can be multidimensional

In [35]:
# First 20 countries with employment data
countries = np.array([
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
])

# Employment data in 2007 for those 20 countries
employment = np.array([
    ## nice formatting!
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
])

In [38]:
# Accessing elements
print(countries[0])
print(countries[3])

Afghanistan
Angola


In [39]:
# Slicing
print(countries[0:3])
print(countries[:3])
print(countries[17:])
print(countries[:])

['Afghanistan' 'Albania' 'Algeria']
['Afghanistan' 'Albania' 'Algeria']
['Bhutan' 'Bolivia' 'Bosnia and Herzegovina']
['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Argentina' 'Armenia'
 'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh'
 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina']


In [41]:
# Element types
print(countries.dtype)
print(employment.dtype)
print(np.array([0, 1, 2, 3]).dtype)
print(np.array([1.0, 1.5, 2.0, 2.5]).dtype)
print(np.array(['AL', 'AK', 'AZ', 'AR', 'CA']).dtype)

<U22
float64
int64
float64
<U2


In [42]:
# Looping. 
for country in countries:
    print(country)

Afghanistan
Albania
Algeria
Angola
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Bosnia and Herzegovina


In [43]:
print(employment.mean())
print(employment.std())
print(employment.max())
print(employment.sum())

58.6850000385
9.33826911369
75.69999695
1173.70000077


In [46]:
def max_employment(countries, employment):
    '''
    Fill in this function to return the name of the country
    with the highest employment in the given employment
    data, and the employment in that country.
    '''
    max_country = countries[employment.argmax()]      # Replace this with your code
    max_value = employment.max()   # Replace this with your code

    return (max_country, max_value)

max_employment(countries, employment)

('Angola', 75.699996949999999)

## Vectorized Operations

### Addition
A vector is a list of numbers.

NumPy array behaves as linear algebra vector addition.

Python lists '+' sign implements lists concatenations.

In [47]:
# Python list
[1, 2, 3] + [4, 5, 6]

[1, 2, 3, 4, 5, 6]

In [48]:
# NumPy array
np.array([1, 2, 3]) + np.array([4, 5, 6])

array([5, 7, 9])

### Multiplication

In [50]:
# Python array
[1, 2, 3] * 3

[1, 2, 3, 1, 2, 3, 1, 2, 3]

In [51]:
# NumPy array
np.array([1, 2, 3]) * 3

array([3, 6, 9])

### More vectorized operations

Math operations:
- Add: +
- Subtract: -
- Multiply: *
- Divide: /
- Exponentiate: **

Logical operations (array should be boolean!):
- And: &
- Or: |
- Not: ~

Comparison operations:
- Greater: >
- Greater or equal: >=
- Less: <
- Less or equal: <=
- Equal: ==
- Not equal: !=

In NumPy, a & b performs a bitwise and of a and b. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a and b are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b) or convert them into boolean vectors first.

Similarly, a | b performs a bitwise or, and ~a performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.

In [55]:
# Arithmetic operations between 2 NumPy arrays.

a = np.array([1, 2, 3, 4])
b = np.array([1, 2, 1, 2])

print(a + b)
print(a - b)
print(a * b)
print(a / b)
print(a**b)

[2 4 4 6]
[0 0 2 2]
[1 4 3 8]
[ 1.  1.  3.  2.]
[ 1  4  3 16]


In [56]:
# Arithmetic operations between a NumPy array and a single number.
a = np.array([1, 2, 3, 4])
b = 2

print(a + b)
print(a - b)
print(a * b)
print(a / b)
print(a**2)

[3 4 5 6]
[-1  0  1  2]
[2 4 6 8]
[ 0.5  1.   1.5  2. ]
[ 1  4  9 16]


In [63]:
# Logical operations with NumPy arrays.
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])

print(a & b)
print(a | b)
print(~a)

print(a & True)
print(a & False)

print(a | True)
print(a | False)

[ True False False False]
[ True  True  True False]
[False False  True  True]
[ True  True False False]
[False False False False]
[ True  True  True  True]
[ True  True False False]


In [64]:
# Comparison operations between 2 NumPy Arrays.
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

print(a > b)
print(a >= b)
print(a < b)
print(a <= b)
print(a == b)
print(a != b)

[False False False  True  True]
[False False  True  True  True]
[ True  True False False False]
[ True  True  True False False]
[False False  True False False]
[ True  True False  True  True]


In [66]:
# Comparison operations between a NumPy array and a single number
a = np.array([1, 2, 3, 4])
b = 2

print(a > b)
print(a >= b)
print(a < b)
print(a <= b)
print(a == b)
print(a != b)

[False False  True  True]
[False  True  True  True]
[ True False False False]
[ True  True False False]
[False  True False False]
[ True False  True  True]


In [69]:
# First 20 countries with school completion data
countries = np.array([
       'Algeria', 'Argentina', 'Armenia', 'Aruba', 'Austria','Azerbaijan',
       'Bahamas', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bolivia',
       'Botswana', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cambodia', 'Cameroon', 'Cape Verde'
])

# Female school completion rate in 2007 for those 20 countries
female_completion = np.array([
    97.35583,  104.62379,  103.02998,   95.14321,  103.69019,
    98.49185,  100.88828,   95.43974,   92.11484,   91.54804,
    95.98029,   98.22902,   96.12179,  119.28105,   97.84627,
    29.07386,   38.41644,   90.70509,   51.7478 ,   95.45072
])

# Male school completion rate in 2007 for those 20 countries
male_completion = np.array([
     95.47622,  100.66476,   99.7926 ,   91.48936,  103.22096,
     97.80458,  103.81398,   88.11736,   93.55611,   87.76347,
    102.45714,   98.73953,   92.22388,  115.3892 ,   98.70502,
     37.00692,   45.39401,   91.22084,   62.42028,   90.66958
])

def overall_completion_rate(female_completion, male_completion):
    '''
    Fill in this function to return a NumPy array containing the overall
    school completion rate for each country. The arguments are NumPy
    arrays giving the female and male completion of each country in
    the same order.
    '''
    return (female_completion + male_completion) / 2.

overall_completion_rate(female_completion, male_completion)

array([  96.416025,  102.644275,  101.41129 ,   93.316285,  103.455575,
         98.148215,  102.35113 ,   91.77855 ,   92.835475,   89.655755,
         99.218715,   98.484275,   94.172835,  117.335125,   98.275645,
         33.04039 ,   41.905225,   90.962965,   57.08404 ,   93.06015 ])

## Standardizing Data

How does one data point compare to the rest?
e.g. employment in U.S. vs. other countries

To answer, convert each data point to number of standard deviations away from the mean.

In 2007:
- mean employment rate: 58.6%
- standard deviation: 10.5%

- United States: 62.3% 
- Difference between the employment rate and the mean employment rate was: 3.7% or 0.35 sd

- Mexico: 57.9%
- Difference between the employment rate and the mean employment rate was: -0.7% or -0.067 sd



In [None]:
# do it!

## NumPy Index Arrays

In [81]:
data = np.array([1, 2, 3, 4, 5])
index = np.array([False, False, True, True, True])
data[index_array]

array([3, 4, 5])

In [82]:
index = data > 2
data[index]

array([3, 4, 5])

In [83]:
# more pythonic
data[data > 2]

array([3, 4, 5])