# Lesson2. Numpy and Pandas for 1D Data

## 1. Introduction

## 2. Gapmider Data

The data in this lesson was obtained from the site gapminder.org. The variables included are:

- Aged 15+ Employment Rate (%)
- Life Expectancy (years)
- GDP/capita (US$, inflation adjusted)
- Primary school completion (% of boys)
- Primary school completion (% of girls)
You can also obtain the data to anlayze on your own from the Downloadables section.

## 3. 1D Data in Numpy and Pandas

- Numpy & Pandas
     - 데이터의 입출력의 속도가 빠름
     - 다양한 함수를 활용하여 분석이 간편함
     - 데이터를 분석하기에 유용
     
     
## 4. Numpy Arrays

**One-Dimensional Data Structures**

| -                      | Pandas           | Numpy (Numerical Python) |
| ---------------------- | ---------------- | ------------------------ |
| Data Structure (one-d) | Series           | Array                    |
|                        | -> More Features | -> Simpler               |

- Pandas Series are built on top of NumPy arrays
- Numpy Arrays =~ Python LIst
     - Sequence of elements and elements can be anyting


**Numpy Arrays vs. Python Lists**

| Sililarities                           | Differences                                  |
| -------------------------------------- | -------------------------------------------- |
| - Access elements by position (idx)    | - Each element should have same type (Array) |
| - Access a range of elements (slicing) | - Convenient functions (mean(), std()...)    |
| - Use loops (for x in a)               | - Can be multi-dimensional                   |
     

In [1]:
import pandas 

In [16]:
import numpy as np

# First 20 countries with employment data
countries = np.array([
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
])

# Employment data in 2007 for those 20 countries
employment = np.array([
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
])

# Change False to True for each block of code to see what it does

# Accessing elements
if False:
    print(countries[0])
    print(countries[3])

# Slicing
if False:
    print(countries[0:3])
    print(countries[:3])
    print(countries[17:])
    print(countries[:])

# Element types
if False:
    print(countries.dtype)
    print(employment.dtype)
    print(np.array([0, 1, 2, 3]).dtype)
    print(np.array([1.0, 1.5, 2.0, 2.5]).dtype)
    print(np.array([True, False, True]).dtype)
    print(np.array(['AL', 'AK', 'AZ', 'AR', 'CA']).dtype)

# Looping
if False:
    for country in countries:
        print('Examining country {}'.format(country))

    for i in range(len(countries)):
        country = countries[i]
        country_employment = employment[i]
        print('Country {} has employment {}'.format(country,
                country_employment))

# Numpy functions
if False:
    print(employment.mean())
    print(employment.std())
    print(employment.max())
    print(employment.sum())

def max_employment(countries, employment):
    '''
    Fill in this function to return the name of the country
    with the highest employment in the given employment
    data, and the employment in that country.
    '''
    
    max_country = ''
    max_value = 0
    
    for i in range(len(countries)):
        country = countries[i]
        country_employment = employment[i]
        
        if country_employment > max_value:
            max_country = country
            max_value = country_employment

    return (max_country, max_value)

print(max_employment(countries, employment))


# easier way

def max_employment2(countries, employment):
    i = employment.argmax()
    return countries[i], employment[i]
    
print(max_employment2(countries, employment))

('Angola', 75.69999695)
('Angola', 75.69999695)


## 5. Vectorized Operations


**Anther Benefit of numpy**
- A vector is a list of numbers
- vector addition
    - vector1 = [1, 2, 3]
    - vector2 = [4, 5, 6]
    - vector1 + vector2 = [5,7,9]
    
## 6. Multiplying by a Scalar

- Multiplying by a Scalar
    - [1, 2, 3] * 3 = [3, 6, 9]


## 7. Calculate Overall Completion Rate

**More Vectorized Operations**

| Math Operations   | Logical Operations                      | Comparison Operations |
| ----------------- | --------------------------------------- | --------------------- |
| Add : +           | And : &                                 | Greater : >           |
| Subtract : -      | or : \|                                 | Greater or equal : >= |
| Multiply : *      | Not : ~                                 | Less ; <              |
| Divide : /        | Make sure your arrays contain booleans! | Less or equal : <=    |
| Exponentiate : ** |                                         | Equal : ==            |
|                   |                                         | Not equal : !=        |

In [3]:
import numpy as np

# Change False to True for each block of code to see what it does

# Arithmetic operations between 2 NumPy arrays
if False:
    a = np.array([1, 2, 3, 4])
    b = np.array([1, 2, 1, 2])
    
    print(a + b)
    print(a - b)
    print(a * b)
    print(a / b)
    print(a ** b)
    
# Arithmetic operations between a NumPy array and a single number
if False:
    a = np.array([1, 2, 3, 4])
    b = 2
    
    print( a + b)
    print(a - b)
    print(a * b)
    print(a / b)
    print(a ** b)
    
# Logical operations with NumPy arrays
if False:
    a = np.array([True, True, False, False])
    b = np.array([True, False, True, False])
    
    print(a & b)
    print(a | b)
    print(~a)
    
    print(a & True)
    print(a & False)
    
    print(a | True)
    print(a | False)
    
# Comparison operations between 2 NumPy Arrays
if False:
    a = np.array([1, 2, 3, 4, 5])
    b = np.array([5, 4, 3, 2, 1])
    
    print(a > b)
    print(a >= b)
    print(a < b)
    print(a <= b)
    print(a == b)
    print(a != b)
    
# Comparison operations between a NumPy array and a single number
if False:
    a = np.array([1, 2, 3, 4])
    b = 2
    
    print(a > b)
    print(a >= b)
    print(a < b)
    print(a <= b)
    print(a == b)
    print(a != b)
    
# First 20 countries with school completion data
countries = np.array([
       'Algeria', 'Argentina', 'Armenia', 'Aruba', 'Austria','Azerbaijan',
       'Bahamas', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bolivia',
       'Botswana', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi',
       'Cambodia', 'Cameroon', 'Cape Verde'
])

# Female school completion rate in 2007 for those 20 countries
female_completion = np.array([
    97.35583,  104.62379,  103.02998,   95.14321,  103.69019,
    98.49185,  100.88828,   95.43974,   92.11484,   91.54804,
    95.98029,   98.22902,   96.12179,  119.28105,   97.84627,
    29.07386,   38.41644,   90.70509,   51.7478 ,   95.45072
])

# Male school completion rate in 2007 for those 20 countries
male_completion = np.array([
     95.47622,  100.66476,   99.7926 ,   91.48936,  103.22096,
     97.80458,  103.81398,   88.11736,   93.55611,   87.76347,
    102.45714,   98.73953,   92.22388,  115.3892 ,   98.70502,
     37.00692,   45.39401,   91.22084,   62.42028,   90.66958
])

def overall_completion_rate(female_completion, male_completion):
    '''
    Fill in this function to return a NumPy array containing the overall
    school completion rate for each country. The arguments are NumPy
    arrays giving the female and male completion of each country in
    the same order.
    '''
    overall_completion = female_completion + male_completion
    return overall_completion /2

print(overall_completion_rate(female_completion, male_completion))

[ 96.416025 102.644275 101.41129   93.316285 103.455575  98.148215
 102.35113   91.77855   92.835475  89.655755  99.218715  98.484275
  94.172835 117.335125  98.275645  33.04039   41.905225  90.962965
  57.08404   93.06015 ]


## 8. Standardizing Data

> How does one data point compare to the rest?    
To answer, convert each data point to number of standard deciations away from the mean