# L2 Notes (NumPy & Pandas 1D) 

## Gapminder Data

1. Aged 15+ Employment Rate (%)
2. Life Expectancy (years)
3. GDP/capita (US$, inflation adjusted)
4. Primary school completion (% of boys)
5. Primary school completion (% of girls)

Obtained from gapminder.org

Questions:
1. Is employment rate and primary school completion correlated over time?
2. Were there any periods of reduced life expectancy in the last 30 yrs?
3. How has GDP changed with employment rates?
4. How has primary school completion differed with boys vs girls?
5. Has an increase in GDP resulted in increased life expentancy?

In [4]:
import pandas as pd

In [5]:
daily_engagement = pd.read_csv('daily_engagement_full.csv')

In [6]:
len(daily_engagement['acct'].unique())

1237

In [7]:
import numpy as np

### Pandas vs NumPy
Series vs. Array
1. Series has more features
2. Array is simpler
3. Series is built on Array

Array ~ Python List

Similarities
1. Access elements by position a[0]
2. Access a range by slicing a[1:3] (upper bound not included)
3. Use for loops

Differences
1. Each element should have same type
2. Includes convenient functions (mean, stddev)
3. Can be multi-dimensional (~ list of lists)

In [8]:
def max_employment(countries, employment):
    i = employment.argmax()
    return (countries[i], employment[i])

In [9]:
employment = pd.read_csv('employment_above_15.csv')

### Vectorized Operations in NumPy

Vector = list of numbers

Uses linear algebra rules



### Standardized Values = Values - Mean / STD

### Index Arrays

An array of booleans that you can compare against an array of integers

a[a>2] -- keep all values in array a that are greater than 2



### + vs +=

"+=" updates the values in the original array --> operates in-place

"+" creates a new array with new values

In [10]:
def variable_correlation(variable1, variable2):
    variable1_above = variable1[variable1 > variable1.mean()]
    variable2_above = variable2[variable2 > variable2.mean()]
    
    directions = variable1_above + variable2_above
    num_same_direction = 0
    num_different_direction = 0
    
    for direction in directions:
        if direction == 2 or direction == 0:
            num_same_direction += 1
        else:
            num_different_direction += 1
    
    return (num_same_direction, num_different_direction)

In [11]:
variable1 = pd.Series([1,2,3,4])
variable2 = pd.Series([10,11,12,13])

In [12]:
variable_correlation(variable1,variable2)

(0, 2)

### Panda Series

Cross between a list and a dictionary

Each value has an index -- series.loc['index']

Use series.iloc[0] to access values by position

Adding series with different indexes will result in NaN for non-matching indexes

In [13]:
def max_employment(employment):
    max_country = employment.idxmax()
    max_value = employment.loc[max_country]

    return (max_country, max_value)

### Filling Missing Values

What if you don't want NaN, how do you drop values?

1) .dropna()

2) treat missing values as 0 before math


In [20]:
import pandas as pd

s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])

s1.add(s2, fill_value=0)

a     1.0
b     2.0
c    13.0
d    24.0
e    30.0
f    40.0
dtype: float64

### Apply() function

Takes a series + function, creates a new series by applying the function to each element in the series

In [23]:
names = pd.Series([
    'Andre Agassi',
    'Barry Bonds',
    'Christopher Columbus',
    'Daniel Defoe',
    'Emilio Estevez',
    'Fred Flintstone',
    'Greta Garbo',
    'Humbert Humbert',
    'Ivan Ilych',
    'James Joyce',
    'Keira Knightley',
    'Lois Lane',
    'Mike Myers',
    'Nick Nolte',
    'Ozzy Osbourne',
    'Pablo Picasso',
    'Quirinus Quirrell',
    'Rachael Ray',
    'Susan Sarandon',
    'Tina Turner',
    'Ugueth Urbina',
    'Vince Vaughn',
    'Woodrow Wilson',
    'Yoji Yamada',
    'Zinedine Zidane'
])

In [33]:
def reverse_name(name):
    split_name = name.split(" ")
    first_name = split_name[0]
    last_name = split_name[1]
    return last_name + ", " + first_name

def reverse_names(names):
    return names.apply(reverse_name)

In [34]:
reverse_names(names)


0             Agassi, Andre
1              Bonds, Barry
2     Columbus, Christopher
3             Defoe, Daniel
4           Estevez, Emilio
5          Flintstone, Fred
6              Garbo, Greta
7          Humbert, Humbert
8               Ilych, Ivan
9              Joyce, James
10         Knightley, Keira
11               Lane, Lois
12              Myers, Mike
13              Nolte, Nick
14           Osbourne, Ozzy
15           Picasso, Pablo
16       Quirrell, Quirinus
17             Ray, Rachael
18          Sarandon, Susan
19             Turner, Tina
20           Urbina, Ugueth
21            Vaughn, Vince
22          Wilson, Woodrow
23             Yamada, Yoji
24         Zidane, Zinedine
dtype: object

### Plotting in Pandas

If the variable data is a NuPy array or a Pandas series, just like if it is a list, the code:
    
    import matplotlib.pyplot as plt
    plt.hist(data)

will create a histogram of the data. You can also use:

    data.hist()
    
   