[Nbviwer](http://nbviewer.jupyter.org/github/nuclth/Python_Statistics/blob/master/Intro_to_Pandas.ipynb)

**Last Edited**: 2017-10-05 11:47:23 

# Understanding Data Structures

This notebook is meant to serve as a lightning introduction to working with Pandas data structures. Heavily based on the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro).


## Data structures at a glance

In [177]:
import pandas as pd
import numpy as np

Taken from [Pandas package description](https://pandas.pydata.org/pandas-docs/stable/overview.html)

|Dimensions |Name     |Description|
|:-----     |:-----   |:-----:|
|1          |Series   |	1D labeled homogeneously-typed array |
|2          |DataFrame| General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns |
|3          |Panel    | General 3D labeled, also size-mutable array |

Note that data alignment here is always maintained unless explicitly broken (examples below).

## Series

[Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series)

A 1-dimensional array holding fixed types. These can be created with **3** different types of data inputs:

|Data call|Example|
|:---|:---|
|**scalar**| `pd.Series(5, index=range(5))`|
|**Dict**| `pd.Series({1: 50, 2: 100})`|
|**Ndarray**| `pd.Series(np.random.randn(5))`|

The scalar default size is 1 if no index is specified. One can access variables in the usual way, say `s = pd.Series (1, index = ['a','b'])`. Then `s[0]` and `s['a']` are both valid calls (array-like and dictionary-like respectively).

In [178]:
s1 = pd.Series(range(5))
s1

0    0
1    1
2    2
3    3
4    4
dtype: int64

Below we see the issue of data alignment. When adding the two series, the values associated with each label automatically line up. Note that we also get back `NaN` when the value for at least one of the series is undefined. All values would be `NaN` if our indices had different values (e.g., `a b c d e`).

In [179]:
s2 = pd.Series(range(3))
s1[1:] + s2[:4]

0    NaN
1    2.0
2    4.0
3    NaN
4    NaN
dtype: float64

## DataFrame

[Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

Dataframes can be created from many different inputs. Below I list explicitly **6** different constructions.

### Dict or Series of Dicts

If unspecified, the index starts from zero. If unspecified, columns are the keys of the dictionary.

In [180]:
pd.DataFrame({'one' : pd.Series([1.]), 'two' : pd.Series([1., 2.], index=['a', 'b'])}, columns = ['one', 'two', 'three'])

Unnamed: 0,one,two,three
0,1.0,,
a,,1.0,
b,,2.0,


### Dict of ndarrays/lists

The dicts must be the same size in this dataframe creation.

In [181]:
pd.DataFrame({'one' : [1., 2., 3.], 'two' : [3., 2., 1.]})

Unnamed: 0,one,two
0,1.0,3.0
1,2.0,2.0
2,3.0,1.0


### List of Dicts

In [182]:
pd.DataFrame ([{'a': 1}, {'a': 5, 'b': 10}], index = ['Y', 'Z'])

Unnamed: 0,a,b
Y,1,
Z,5,10.0


### Series

In [183]:
s3 = pd.Series(['A','B'])
s4 = pd.Series(['C','D'])
pd.DataFrame([s3,s4])

Unnamed: 0,0,1
0,A,B
1,C,D


### Dict of Tuples

A way to create multi-index Dataframes. 

In [184]:
pd.DataFrame({('A', 'a1'): {('Y', 'y1'): 1, ('Y', 'y2'): 2},
              ('A', 'a2'): {('Y', 'y1'): 3, ('Y', 'y2'): 4},
              ('A', 'a3'): {('Z', 'z1'): 5, ('Z', 'z2'): 6},
              ('B', 'b1'): {('Z', 'z1'): 7, ('Z', 'z2'): 8},
              ('B', 'b2'): {('Z', 'z1'): 9, ('Z', 'z2'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,A,B,B
Unnamed: 0_level_1,Unnamed: 1_level_1,a1,a2,a3,b1,b2
Y,y1,1.0,3.0,,,
Y,y2,2.0,4.0,,,
Z,z1,,,5.0,7.0,9.0
Z,z2,,,6.0,8.0,10.0


Somewhat confusing in construction. The first two numbers are the column values including nested columns. Next inside the second set of braces are the nested values for rows and then the value. So for example in the code above, the first line reads: 

* top column A with nested column a1 has elements with top row Y and nest row y1 (1.0) and top row Y with nest row y2 (2.0)

### Structured Array

Handled identically to a dict of arrays.

In [185]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

pd.DataFrame(data, index = ['first','second'], columns=['C','B','A'])

Unnamed: 0,C,B,A
first,Hello,2.0,1
second,World,3.0,2


### Dataframe Manipulations

In [186]:
df = pd.DataFrame({'one' : [1., 2., 3.], 'two' : [3., 2., 1.]}); df

Unnamed: 0,one,two
0,1.0,3.0
1,2.0,2.0
2,3.0,1.0


Access is done in the usual way

In [187]:
df['one'][1]

2.0

with commas for nested rows/columns e.g., `df['A', 'a1']`. Other access ways are given below in the table.

|Operation |	Syntax |	Result|
|:---|:---|
|Select  column 	| `df[col]` |	Series|
|Select  row by label |	`df.loc[label]` |	Series|
|Select row by integer location 	| `df.iloc[loc]` | 	Series|
|Slice rows |	`df[5:10]` |	DataFrame|
|Select rows by boolean vector |	`df[bool_vec]` |	DataFrame|

It is easy to create new columns, `df ['A']` where `df` is the dataframe name and `A` is the new column name. The `del` command also allows column deletion. Columns can also be inserted at a particular space with the `.insert` command.

In [188]:
df['three'] = df['one'] * df['two']
del df['two']
df.insert(1, 'bar', df['one'][:2])
df

Unnamed: 0,one,bar,three
0,1.0,1.0,3.0
1,2.0,2.0,4.0
2,3.0,,3.0


This can also be accomplished with the `.assign` command (split into two different commands here)

In [189]:
df4 = df.assign(four = df.three/df.one).assign(five = lambda q: q.three * q.bar)
df4

Unnamed: 0,one,bar,three,four,five
0,1.0,1.0,3.0,3.0,3.0
1,2.0,2.0,4.0,2.0,8.0
2,3.0,,3.0,1.0,


 but note that the original dataframe is unchanged!

In [190]:
df

Unnamed: 0,one,bar,three
0,1.0,1.0,3.0
1,2.0,2.0,4.0
2,3.0,,3.0


The `.query` command can be used to only take certain elements of a dataframe,

In [191]:
df4.query('four < 3')

Unnamed: 0,one,bar,three,four,five
1,2.0,2.0,4.0,2.0,8.0
2,3.0,,3.0,1.0,


## Panel

Panel is deprecated and will be removed in a future version of pandas. Therefore, I do not cover it.

# Resources

[Package Description](https://pandas.pydata.org/pandas-docs/stable/overview.html)