# pandas
good resources:
- Books
    - Python for Data Analysis, Wes McKinney
    - Learning the Pandas library, Matt Harrison
- Online resources
    - Stack overflow
    - planetpython.org
- Podcasts
    - python bytes podcast
    - Data skeptic podcast



In [1]:
import pandas as pd

## series data structure

In [2]:
# series are like a 2 column table, with an index column and a value column
# you can make a series out of a list
cars = ['focus', 'pilot', 'sierra']
car_series = pd.Series(cars)
# a series is an object and in the case of text strings, the data stored is of the type object
print (car_series)
print ("----")

# you can store other data types in series, and pandas will attempt to store them as a homogenous data type
evens = [2, 4, 6]
evens_series = pd.Series(evens)
print (evens_series)
print ("----")

# as with numpy arrays, panda series will 'up-cast' types until it can accomodate all members as one data type
mixed = ['mouse', 2, 3.14 ]
mixed_series = pd.Series(mixed)
print (mixed_series) 

0     focus
1     pilot
2    sierra
dtype: object
----
0    2
1    4
2    6
dtype: int64
----
0    mouse
1        2
2     3.14
dtype: object


## pd handling of None values from python

In [3]:
# in a string series, pandas will convert a None value from python into the string 'None'
# in a numeric series, pandas will convert None to NaN (Not a Number) which is represented interally as a float

text_none = pd.Series(['first', 'second', None])
print (text_none)
print ("----")
numeric_none = pd.Series ([1, 2, None])
print (numeric_none)

0     first
1    second
2      None
dtype: object
----
0    1.0
1    2.0
2    NaN
dtype: float64


In [4]:
import numpy as np
# in pandas, which is built on top of numpy,
# nan and None are not comparable using traditional boolean operators
print (np.nan == None)
print ("----")
# in fact, two instances of nan are not even compararable
print (np.nan == np.nan)
print ("----")
# to perform a boolean comparison you need the numpy function isnan
print (np.isnan(np.nan))

False
----
False
----
True


## creating series from real data

In [5]:
# series with an index of named data elements can be created directly from a python dictionary
sample_dict = {'snf': 15, 'home': 24, 'rehab': 13}
dict_series = pd.Series(sample_dict)
print(dict_series)
print ("----")

# a series and its index can also be created using the series 'index' parameter
index_series = pd.Series([3, 9, 64], index=['Knee', 'Manual', 'Mako'] )
print (index_series)
print ("----")

# you can store more complex datatypes in a series for example a series of tuples
tuple_series = [('Manual', 15), ('Mako', 30)]
print (tuple_series)
print(pd.Series(tuple_series))
print ("----")

# pandas will ignore missing indexes, and return NaN for indexes that are not defined
data_set = {'tom': 'developer', 'asif': 'manager', 'ray': 'tech lead'}
roles = pd.Series (data_set, index = ['tom', 'asif', 'dan'])
print (roles)

snf      15
home     24
rehab    13
dtype: int64
----
Knee       3
Manual     9
Mako      64
dtype: int64
----
[('Manual', 15), ('Mako', 30)]
0    (Manual, 15)
1      (Mako, 30)
dtype: object
----
tom     developer
asif      manager
dan           NaN
dtype: object


## Querying a series

In [6]:
# loc and iloc: query a series in the values column using loc, and in the index column using iloc
patients = {
    "Frank": "Mako",
    "Amelia": "Manual",
    "Jeremy": "Mako"
    }
ps = pd.Series(patients)
print (ps)
print ("----")

# to select the nth item in the series (starting with 0) use the iloc attribute
print (ps.iloc[1])

# to select the item by named index, use the loc attribut
print (ps.loc['Jeremy'])

# if you leave out the loc/iloc parameter pandas will try to figure out which index is being requested
# normally this works flawlessly
print (ps[1])
print (ps['Jeremy'])
# but can produce wrong results if the passed in index it an integer that conflicts with the numeric index
# if the series has numeric index passed in (patient_id for example) it is safest to explicitely use loc and iloc 


Frank       Mako
Amelia    Manual
Jeremy      Mako
dtype: object
----
Manual
Mako
Manual
Mako


## performing operations on a series
Typical python iteration will work on series, but are not the most efficient. Pandas and the underlying NumPy library are optimized for vectorized queries which are orders of magnitude faster than a typical iterative loop

In [7]:
# typical python iteration loop
# vs numpy vectorized function
import numpy as np

scores = pd.Series([10,20,15,35])
total = 0
for score in scores:
    total += score
avgscore = total/len(scores)
print(avgscore)

# this can be rewritten using the numpy.sum function as so
total = np.sum(scores)
avgscore = total/len(scores)
print(avgscore)


20.0
20.0


## measuring speed and performance using %timeit

In [8]:
# we can perform a speed test to determine which technique is faster using a 'magic' function in the Jupyter notebook named timeit
# first we set up a suitably large test case
# The following creates a series of 100000 random numbers between 0 and 1000
numbers = pd.Series(np.random.randint(0,1000,10000))
# to verify that the series is what we think it is, we can use head() to look at the first few, and len() to test the length
print(numbers.head())
print(len(numbers))


0    598
1    624
2     68
3    803
4    977
dtype: int64
10000


In [9]:
# magic functions can be accessed by starting the line with a percentage sign
# cellular functions start with two percentage signs and wrap the code in the current cell
# one such function is timeit this has to be the first line of the cell to work and you can specify how many times you want the code to run by setting a parameter value -n. In the next cell we will run the first method of the iterative loop 100 times

In [10]:
%%timeit -n 100
total = 0
for number in numbers:
    total += number
avgscore = total/len(numbers)

4.18 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
# and next we will perform the same timing test using the vectorized numpy function  

In [12]:
%%timeit -n 100
total = np.sum(numbers)
avgscore = total/len(numbers)
# this was designed to show that the np.sum function was much faster, but this was not the case when I ran it on Chromebook


The slowest run took 5.16 times longer than the fastest. This could mean that an intermediate result is being cached.
533 µs ± 275 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## series operations (broadcasting)

In [13]:
# operations can be performed on all elements of a series
# for exampe in our numbers series from above, we can add 5 to each element
# this is called broadcasting
print(numbers.head())
print ("----")
numbers += 5
print(numbers.head())



0    598
1    624
2     68
3    803
4    977
dtype: int64
----
0    603
1    629
2     73
3    808
4    982
dtype: int64


You can achieve the same result through iteration, and particularly using the interitems() function on a Pandas series, for example

```for label, value in numbers.iteritems(): ...``` 

but anytime you are interating in Pandas, you should question whether you are doing the right thing. Most frequently there will be a way to do the same thing faster and more efficiently with a broadcast function

## pandas indices
In a pandas series, index values do not need to be unique. This is quite different from a relational database
