## NumPy and Pandas for 1D Data
### One-Dimensional Data in NumPy and Pandas
With pandas, loading will be much faster. Use a larger file, daily_engagement_full_.csv, as example.

In [1]:
import pandas as pd
daily_engagement = pd.read_csv('daily_engagement_full.csv')
len(daily_engagement['acct'].unique())

1237

### NumPy Arrays
NumPy and Pandas have a special data structure made to represent 1D data.
NumPy 1D structure called Array, and Pandas 1D structure called Series (built on top of NumPy array).
Pandas Series have more features while NumPy array is simpler.
The data structure of array has similarities and differences with Python lists.

E.g. NumPy array for US states: 'AL', 'AK', 'AZ',,,

**Similarities**
* Access elements by position
    * a[0] --> 'AL'
* Access a range of elements
    * a[1:3] --> 'AK', 'AZ'
* Use loops
    * for x in a:

**Differences**
* Each element should have same datatype
    * string, int, boolean, etc
* Convenient functions
    * mean(), std(), etc
* Can be multi-dimensional

### Vectorized Operations

In [3]:
import numpy as np

# Vector Addition
a = np.array([1,2,3])
b = np.array([4,5,6])
print a+b

# Vector Multiplication
print a*3

[5 7 9]
[3 6 9]


There are more vectorized operations.

* Math Operations
    * Add: +
    * Substract: -
    * Multiply: *
    * Divide: /
    * Exponentiate: **

* Logical Operations (make sure the arrays contain booleans)
    * And: &
    * Or: |
    * Not: ~

* Comparison Operations
    * Greated: >
    * Greater or equal: >=
    * Less: <
    * Less or equal: <=
    * Equal: ==
    * Not equal: !=    

### NumPy Index Arrays

In [4]:
'''
Fill in the function to calculate the mean time spent in the classroom
for students who stayed enrolled at least (greater than or equal to) 7 days.
Assume that days_to_cancel will contain only integers (there are no students who have not canceled yet).
    
The arguments are NumPy arrays. time_spent contains the amount of time spent
in the classroom for each student, and days_to_cancel contains the number
of days until each student cancel. The data is given in the same order
in both arrays.
    '''

def mean_time_for_paid_students(time_spent, days_to_cancel):
    return time_spent[days_to_cancel >= 7].mean()

# Time spent in the classroom in the first week for 20 students
time_spent = np.array([
       12.89697233,    0.        ,   64.55043217,    0.        ,
       24.2315615 ,   39.991625  ,    0.        ,    0.        ,
      147.20683783,    0.        ,    0.        ,    0.        ,
       45.18261617,  157.60454283,  133.2434615 ,   52.85000767,
        0.        ,   54.9204785 ,   26.78142417,    0.
])

# Days to cancel for 20 students
days_to_cancel = np.array([
      4,   5,  37,   3,  12,   4,  35,  38,   5,  37,   3,   3,  68,
     38,  98,   2, 249,   2, 127,  35
])

mean_time_for_paid_students(time_spent, days_to_cancel)

41.054003485454537

### In-Place vs. Not In-Place
+= will modify the existing array, or in-place, while + will create a new array and then get updated. 

In [5]:
# Examples of in-place vs. not in-place
a = np.array([1,2,3,4])
b = a
a+= np.array([1,1,1,1])
print "In-Place:", b

a = np.array([1,2,3,4])
b = a
a= a + np.array([1,1,1,1])
print "Not In-Place:",b


a = np.array([1,2,3,4])
slice = a[:3] # Here it does not create a new array. It is just a snapshot, or view
slice[0] = 100 # Because 'slice' is just a view, any modification made on the slice will reflect to the original array.
print a

In-Place: [2 3 4 5]
Not In-Place: [1 2 3 4]
[100   2   3   4]


### Pandas Series
A series is similar to a NumPy array but with extra functionalities, such as s.describe()

** Similarities to Array **
* Accesssing elements
* Looping
* Convinient functions
* Vectorized operations
* Implemented in C (fast!)

In [6]:
'''
Fill in the function, variable_correlation, to calculate the number of data points for which
the directions of variable1 and variable2 relative to the mean are the same, 
and the number of data points for which they are different.
Direction here means whether each value is above or below its mean.
    
You can classify cases where the value is equal to the mean for one or
both variables however you like.
    
Each argument will be a Pandas series.

'''
countries = ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
             'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan',
             'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
             'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia']

life_expectancy_values = [74.7,  75. ,  83.4,  57.6,  74.6,  75.4,  72.3,  81.5,  80.2,
                          70.3,  72.1,  76.4,  68.1,  75.2,  69.8,  79.4,  70.8,  62.7,
                          67.3,  70.6]

gdp_values = [ 1681.61390973,   2155.48523109,  21495.80508273,    562.98768478,
              13495.1274663 ,   9388.68852258,   1424.19056199,  24765.54890176,
              27036.48733192,   1945.63754911,  21721.61840978,  13373.21993972,
                483.97086804,   9783.98417323,   2253.46411147,  25034.66692293,
               3680.91642923,    366.04496652,   1175.92638695,   1132.21387981]

life_expectancy = pd.Series(life_expectancy_values)
gdp = pd.Series(gdp_values)

def variable_correlation(variable1, variable2):
    both_above = (variable1 > variable1.mean()) & (variable2 > variable2.mean())
    both_below = (variable1 < variable1.mean()) & (variable2 < variable2.mean())
    same_direction = both_above|both_below
    num_same_direction = same_direction.sum()
    num_diff_direction = len(variable1) - num_same_direction
    
    return (num_same_direction, num_diff_direction)


variable_correlation(life_expectancy,gdp)    

(17, 3)

### Series Indexes
A Pandas series is like a cross between a list and a dictionary: able to access by position or key

In [7]:
life_expectancy = pd.Series(life_expectancy_values, index = countries )

print "Life expectancy in the country postion at 0:", life_expectancy.iloc[0]
print "Life expectancy in the country index is Angola:", life_expectancy.loc['Angola']

Life expectancy in the country postion at 0: 74.7
Life expectancy in the country index is Angola: 57.6


In [8]:
'''
Fill in the function to return the name of the country
with the highest employment in the given employment
data, and the employment in that country.

'''

countries = [
    'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
    'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
    'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
    'Belize', 'Benin', 'Bhutan', 'Bolivia',
    'Bosnia and Herzegovina'
]


employment_values = [
    55.70000076,  51.40000153,  50.5       ,  75.69999695,
    58.40000153,  40.09999847,  61.5       ,  57.09999847,
    60.90000153,  66.59999847,  60.40000153,  68.09999847,
    66.90000153,  53.40000153,  48.59999847,  56.79999924,
    71.59999847,  58.40000153,  70.40000153,  41.20000076
]

employment = pd.Series(employment_values, index=countries)

def max_employment(employment):
    max_country = employment.argmax()      # Replace this with your code
    max_value = employment.loc[max_country]   # Replace this with your code

    return (max_country, max_value)

max_employment(employment)

('Angola', 75.699996949999999)

### Filling Missing Values
Write two sets of code:
* add the 2 series together, and drop Nan, and
* add the 2 series together, but treating missing values from either series as 0

In [9]:
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'e', 'f'])
sum_result = s1 + s2

print "Drop:\n",sum_result.dropna()
print "Fill:\n", s1.add(s2, fill_value = 0)

Drop:
c    13.0
d    24.0
dtype: float64
Fill:
a     1.0
b     2.0
c    13.0
d    24.0
e    30.0
f    40.0
dtype: float64


### Pandas Series apply()
apply() takes a series and a function, and returns a new series.
Here is an example:

In [10]:
names = pd.Series([
    'Andre Agassi',
    'Ian Clark',
    'Stephen Curry',
    'Kevin Durant',
    'Draymond Green',
    'Andre Iguodala',
    'Damian Jones',
    'Shaun Livingston',
    'Kevon Looney',
    'James-Michael McAdoo',
    'Patrick McCaw',
    'JaVale McGee',
    'Zaza Pachulia',
    'Klay Thompson',
    'Anderson Varejao',
    'David West',
])

def reverse_name(name):
    split_name = name.split(" ")
    first_name = split_name[0]
    last_name = split_name[1]
    return last_name + ", " + first_name

def reverse_names(names):
    return names.apply(reverse_name)

reverse_names(names)

0             Agassi, Andre
1                Clark, Ian
2            Curry, Stephen
3             Durant, Kevin
4           Green, Draymond
5           Iguodala, Andre
6             Jones, Damian
7         Livingston, Shaun
8             Looney, Kevon
9     McAdoo, James-Michael
10           McCaw, Patrick
11            McGee, JaVale
12           Pachulia, Zaza
13           Thompson, Klay
14        Varejao, Anderson
15              West, David
dtype: object