# Python Pandas

Portmanteau for panel + data, panel data is data that is multi-dimensional that varies over time. <br>

Provides fast, flexible, and expressive data structures that make working with relational, or labeled data, really easy for data analysis. Pandas relies heavily on NumPy to work, the library sits right on top of the package. 

In [2]:
import numpy as np
import pandas as pd
from utils import render

### Exercise One - A Series

A __series__ like a python dictionary and a NumPy array got together and had a kid. It inherits the qualities of both. You can specify the datatype of a NumPy array, and you can grab values and key (called lables = key), keys are also called index. Unlike keys of a dictionary, Pandas labels don't have to be unique. Methods are taking missing data into account during their calculations.

#### Creating a Series from a Python Dictionary
There are a couple ways to create a new __series__ from scratch, the first example is below:

In [3]:
test_balance_data = {
    'pasan': 20.00,
    'treasure': 20.18,
    'ashley': 1.05,
    'craig': 42.42,
}

# The series constructor accepts any dict-like object
balances = pd.Series(test_balance_data)
balances

pasan       20.00
treasure    20.18
ashley       1.05
craig       42.42
dtype: float64

#### Creating a Series from an Iterable
I can also create a series from any iterable as well. __Note__ that when labels are not present, they're defaulted to incremental integers starting at 0. 

In [4]:
unlabled_balances = pd.Series([20.00, 20.18, 1.05, 42.42])
unlabled_balances

0    20.00
1    20.18
2     1.05
3    42.42
dtype: float64

I can also provide the `index=` argument that requires an iterable the same size as the data. The order of the labels is guaranteed to match the same order of the supplied index. 

In [5]:
labeled_balances = pd.Series(
    [20.00, 20.18, 1.05, 42.42],
    index=['pasan', 'treasure', 'ashley', 'craig']
)
labeled_balances

pasan       20.00
treasure    20.18
ashley       1.05
craig       42.42
dtype: float64

A NumPy array is also iterable, so I can create a new Series from an ndarray. 

In [6]:
ndbalances = np.array([20.00, 20.18, 1.05, 42.42])
pd.Series(ndbalances)

0    20.00
1    20.18
2     1.05
3    42.42
dtype: float64

Finally, I can pass in a scalar and create a pandas datatable from a scalar and an index. If I passs in a scalar, because it is a single value, it will be broadcast to each of the keys specified in the index keyword argument.

In [7]:
pd.Series(20.00, index=["guil", "jay", "james", "ben", "nick"])

guil     20.0
jay      20.0
james    20.0
ben      20.0
nick     20.0
dtype: float64

__Note__ the keys are autogenerated when no index is specified. 

### Exercise Two - Accessing a Series

There are multiple ways for me to access the data. The `series` is indexed by username. The label is the username, so the value is that user's cash balance. A series is ordered and indexable, it is zero-based, so I can access it by index just like I would by a list or an array. 

In [8]:
balances

pasan       20.00
treasure    20.18
ashley       1.05
craig       42.42
dtype: float64

In [9]:
balances[0]

20.0

In [10]:
type(balances[0])

numpy.float64

The value is wrapped in a NumPy.Scalar so that it keeps its data type and will work with other data types and structures.<br>

The same positional indexing works just like a standard list. The indices begin start with 0, and negative numbers can be used to access values from the end of a list. 

In [11]:
# Accessing the last balance
balances[-1]

42.42

#### Accessing by Label
Since a series is labelled, I can access it like I would a dict:

In [12]:
balances['pasan']

20.0

In [15]:
for label, value in balances.items():
    render("The label {} has a value of {}".format(label, value))

The label pasan has a value of 20.0

The label treasure has a value of 20.18

The label ashley has a value of 1.05

The label craig has a value of 42.42

In [18]:
try:
    balances['kermit']
except KeyError:
    render('Accessing a non-existent key raises an error')

Accessing a non-existent key raises an error

In [21]:
if balances.get('kermit') is None:
    render('Use `get` to safely access keys. `None` is returned if key not present.')

Use `get` to safely access keys. `None` is returned if key not present.

In [23]:
if 'kermit' not in balances:
    render('Use `in` to test the existence of a label')

Use `in` to test the existence of a label

#### Accessing by Property

As long as the label meets variable naming constraints, it will be available as a property via dot notation on the `series`. 

In [25]:
balances.ashley

1.05

#### Accessing More Explicitly w/  `loc` and `iloc`

So far, I've been able to use a label and a positional index to access the values I want. Because of the ambiguity between what is being used to access the values - either a label or a position - it is possible to be more explicit.

`.loc` can be used to explicitly lookup by label-based indices only, while `.iloc` can be used to use the positional index explicitly. 


In [27]:
balances.loc['pasan']

20.0

In [29]:
balances.iloc[0]

20.0

#### Accessing by Slice

A series also provides a way to use slices to get different portions of the data returned as a Series. <br>

The first, __slicing by positional index__ the slice is exclusive. The last item is __not__ included. 

In [31]:
# includes values from zero
# up until **and not** including 3
balances.iloc[0:3]

pasan       20.00
treasure    20.18
ashley       1.05
dtype: float64

I can also slice by label, which inclusive and the last item __is__ included. 

In [32]:
# Includes the values starting at 'pasan'
# up until **and** including 'ashley'
balances.loc['pasan':'ashley']

pasan       20.00
treasure    20.18
ashley       1.05
dtype: float64

### Exercise Three - Series Vectorization and Broadcasting

In [1]:
import pandas as pd

test_balance_data = {
    'pasan': 20.00,
    'treasure': 20.18,
    'ashley': 1.05,
    'craig': 42.42,
}

test_deposit_data = {
    'pasan': 20,
    'treasure': 10,
    'ashley': 100,
    'craig': 55,   
}

balances = pd.Series(test_balance_data)
deposits = pd.Series(test_deposit_data)

__Vectorization__ allows us to avoid casting everything as a loop and instead apply operations uniformly across a vector. It's faster and easier to read and write. This is especially important to note when working with data like the CSVs I'm working with the II on. <br>

Below is an example of a for loop that loops through each item's price and applies it to another:

In [2]:
for label, value in deposits.iteritems():
    balances[label] += value
balances

pasan        40.00
treasure     30.18
ashley      101.05
craig        97.42
dtype: float64

Below is the __exact same function__, just applying the vectorization principles. 

In [11]:
# Undo the change using inplace subtraction
balances -= deposits

# This is the same as the loop above using inplace addition
balances += deposits
balances

pasan        40.00
treasure     30.18
ashley      101.05
craig        97.42
dtype: float64

What happens when the number of values matches up, but the indexes are out of whack?

In [12]:
totals = {
    'mario': 135,
    'peach': 149,
    'yoshi': 122,
}
final = {
    'peach': 45,
    'mario': 63,
    'yoshi': 77,
}
total_laps = pd.Series(totals)
final_lap = pd.Series(final)
total_laps + final_lap

mario    198
peach    194
yoshi    199
dtype: int64

The labels will line up and the vectorization will take care of element to element addition :) 

__Broadcasting__ in Pandas works the same way - the mathematical operators have been overridden to use the vectorized versions of the same operation. This override is why Pandas can do what it does without interfering with the rules of Python. 

#### Broadcasting a Scalar

In [4]:
# The integer 5 is broadcasted and added to each and every value.
# This returns a new series.
balances + 5

pasan        45.00
treasure     35.18
ashley      106.05
craig       102.42
dtype: float64

#### Broadcasting a Series
Lables are used to line up entries. When the label exists on one side only, a np.nan (meaning not a number) is put in its place. 

In [5]:
coupons = pd.Series(1, ['craig', 'ashley', 'james'])
coupons

craig     1
ashley    1
james     1
dtype: int64

Now I'll broadcast that new value and add it to the balance sheet. <br>
Note that pasan and treasure aren't included in the coupons list. See what happens to their entries below: (Also note that James __was__ included on the coupons list, BUT he's not in the original balances list. See what happens to him as well.

In [6]:
# Returns a new series
balances + coupons

ashley      102.05
craig        98.42
james          NaN
pasan          NaN
treasure       NaN
dtype: float64

Obviously, James, Treasure, and Pasan all had values beforehand. This demonstrates the limits of broadcasting, but those can be rectified using the `fill_value` parameter, shown below. 

#### Using the `fill_value` parameter
The `fill_value` parameter/argument fills in missing values so that everything aligns. The concept is to use the `add` method directly alongside the keyword argument `fill_value`) 

In [8]:
# Returns a new series
balances.add(coupons, fill_value=0)

ashley      102.05
craig        98.42
james         1.00
pasan        40.00
treasure     30.18
dtype: float64

Notice that I'm substituting a basic Python operator (+, -, etc.) with a string command that can take in new arguments. 

In [9]:
remaining = {
    'mario': 3,
    'peach': 2,
    'yoshi': 2,
}
completed = {
    'peach': 1,
    'bowser': 2,
}
remaining_laps = pd.Series(remaining)
completions = pd.Series(completed)
# remaining_laps - completions

bowser    NaN
mario     NaN
peach     1.0
yoshi     NaN
dtype: float64

Obviously, I want to read in the other array's information as well, so instead of using the remaining_laps - completions, I'll correct this by using the subtract vectorized method with a `fill_value=0`.

In [10]:
# Returns a new series
remaining_laps.subtract(completions, fill_value=0)

bowser   -2.0
mario     3.0
peach     1.0
yoshi     2.0
dtype: float64

### Exercise Four - DataFrames

A DataFrame is basically just a two dimensional collection of Series. There are many ways of creating a `DataFrame` from existing objects, like a series.

#### Creating a DataFrame from a 2-Dimensional Object
If my data is already in rows and columns, like a list of lists, I can just pass the data along to the constructor. Labels and Column headings will automatically be generated as a range. Very easy!

In [13]:
test_users_list = [
    ['Craig', 'Dennis', 42.42],
    ['Treasure', 'Porth', 25.00]
]

pd.DataFrame(test_users_list)

Unnamed: 0,0,1,2
0,Craig,Dennis,42.42
1,Treasure,Porth,25.0


In the above example, both the labels and column headings are autogenerated. I can specify the `index` and `columns`, shown below:

In [14]:
pd.DataFrame(test_users_list, index=['craigsdennis', 'treasure'],
            columns=['first_name', 'last_name', 'balance'])

Unnamed: 0,first_name,last_name,balance
craigsdennis,Craig,Dennis,42.42
treasure,Treasure,Porth,25.0


#### Creating a DataFrame from a Dictionary
Like a `Series`, if I don't specify the index, it'll be autogenerated in range format. 

In [16]:
# The Default expected Dictionary layout is column name to ordered values
test_user_data = {
    'first_name': ['Craig', 'Treasure'],
    'last_name': ['Dennis', 'Porth'],
    'balance': [42.42, 25.00]
}

pd.DataFrame(test_user_data)

Unnamed: 0,first_name,last_name,balance
0,Craig,Dennis,42.42
1,Treasure,Porth,25.0


Sometimes I might not want integers as my index, so I can specify it by supplying the `index=['string','string']` keyword argument

In [17]:
pd.DataFrame(test_user_data, index=['craigsdennis', 'treasure'])

Unnamed: 0,first_name,last_name,balance
craigsdennis,Craig,Dennis,42.42
treasure,Treasure,Porth,25.0


#### Adding More Options w/  `DataFrame.from_dict`
The `orient` keyword allows me to specify whether the keys of my dictionary are part of the labels (`index`) or the column titles (`columns`). <br>

NOTE: The nested dicitonaries have been used to define the columns. I could also pass a list to the `columns`. 

In [18]:
by_username = {
    'craigsdennis': {
        'first_name': 'Craig',
        'last_name': 'Dennis',
        'balance': 42.42
    },
    'treasure': {
        'first_name': 'Treasure',
        'last_name': 'Porth',
        'balance': 25.00
    }
}

pd.DataFrame.from_dict(by_username, orient='index')

Unnamed: 0,first_name,last_name,balance
craigsdennis,Craig,Dennis,42.42
treasure,Treasure,Porth,25.0
