# Pandas Data Structure

## Creating your own data

### Series

`Series` == one-dimensional container, similar to Python `list`, except each element must be the same `dtype`

Represents each column of `DataFrame`.
`DataFrame` ~~ a dictionary of `Series` objects where key=column name, value=Series

In [23]:
import pandas as pd
import numpy as np
from collections import OrderedDict

In [5]:
s = pd.Series(['banana', 42])
s

0    banana
1        42
dtype: object

In [8]:
# Index can be assigned to the Series
# manually assign index values to a series
# by passing a Python list
s = pd.Series(['Wes McKinney', 'Creator of Pandas'], index=['Person', 'Who'])
s

Person         Wes McKinney
Who       Creator of Pandas
dtype: object

Question: What happens if `list`, `tuple`, `dict`, `numpy.ndarray` is used?

Answer: Works just fine

In [13]:
# Tuple t
t = ('Wes McKinney', 'Creator of Pandas')
i = ('Person', 'Who')
s = pd.Series(t, index=i)
s

Person         Wes McKinney
Who       Creator of Pandas
dtype: object

In [12]:
# Dict d
d = {
    'Person': 'Wes McKinney',
    'Who': 'Creator of Pandas'
}
s = pd.Series(d)
s

Person         Wes McKinney
Who       Creator of Pandas
dtype: object

In [18]:
# Numpy ndarray n
l = ['Wes McKinney', 'Creator of Pandas']
n = np.array(l)
s = pd.Series(n, index=np.array(['Person', 'Who']))
s

Person         Wes McKinney
Who       Creator of Pandas
dtype: object

Question: Does passing in an `index` when you use a `dict` overwrite the index? Or does it sort the values?

Answer: It overwrites the index

In [20]:
# Dict d
d = {
    'Person': 'Wes McKinney',
    'Who': 'Creator of Pandas'
}
i = ['Who', 'Person']
s = pd.Series(d, index=i)
s

Who       Creator of Pandas
Person         Wes McKinney
dtype: object

### DataFrame

`DataFrame` == dictionary of `Series` objects.
Where `key`=column name, `values`=contents of column

In [21]:
scientists = pd.DataFrame({
    'Name': ['Rosaline Franklin', 'William Gosset'],
    'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16'],
    'Age': [37, 61]
})
scientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


Notice: order is not guaranteed

In [22]:
scientists = pd.DataFrame({
    'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16'],
    'Age': [37, 61]
    },
    index=['Rosaline Franklin', 'William Gosset'],
    columns=['Occupation', 'Born', 'Died', 'Age'])
scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


In [24]:
# Using OrderedDict
# note the round brackets after OrderedDict
# then we pass a list of 2-tuples
scientists = pd.DataFrame(
    OrderedDict([
        ('Name', ['Rosaline Franklin', 'William Gosset']),
        ('Occupation', ['Chemist', 'Statistician']),
        ('Born', ['1920-07-25', '1876-06-13']),
        ('Died', ['1958-04-16', '1937-10-16']),
        ('Age', [37, 61])
    ])
)
scientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


## Series

In [25]:
# create example dataframe
scientists = pd.DataFrame(
    data={'Occupation': ['Chemist', 'Statistician'],
    'Born': ['1920-07-25', '1876-06-13'],
    'Died': ['1958-04-16', '1937-10-16'],
    'Age': [37, 61]},
    index=['Rosaline Franklin', 'William Gosset'],
    columns=['Occupation', 'Born', 'Died', 'Age']
)
scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


In [27]:
# select by row index label
first_row = scientists.loc['William Gosset']
type(first_row)

pandas.core.series.Series

In [28]:
first_row

Occupation    Statistician
Born            1876-06-13
Died            1937-10-16
Age                     61
Name: William Gosset, dtype: object

If we use the `loc` attribute to subset the first row of our `scientists` dataframe, we will get a `Series` object back.

In [29]:
first_row.index

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

In [30]:
first_row.values

array(['Statistician', '1876-06-13', '1937-10-16', 61], dtype=object)

In [31]:
# keys() => index
first_row.keys()

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

In [32]:
first_row.index[0]

'Occupation'

In [33]:
first_row.keys()[0]

'Occupation'

### `pandas.Series` is `numpy.ndarray`-like

`pandas.Series` is very similar to `numpy.ndarray`
Often referred to as a "vector"

In [35]:
ages = scientists['Age']
ages

Rosaline Franklin    37
William Gosset       61
Name: Age, dtype: int64

In [36]:
ages.mean()

49.0

In [37]:
ages.min()

37

In [38]:
ages.max()

61

In [39]:
ages.std()

16.97056274847714

In [40]:
ages.describe()

count     2.000000
mean     49.000000
std      16.970563
min      37.000000
25%      43.000000
50%      49.000000
75%      55.000000
max      61.000000
Name: Age, dtype: float64

In [56]:
scientists.transpose() is scientists

False

In [52]:
t is ages

True

Some of the Methods that can be performed on a Series

| func                         | desc                                            |
|------------------------------|-------------------------------------------------|
| append                       | Concatenates 2+ series                          |
| corr                         | Calc correlation with another Series            |
| cov                          | Calc covar with another Series                  |
| describe                     | Summary statistics                              |
| drop_duplicates              | Returns a copy without duplicates               |
| equals                       | compare two Series                              |
| values                       | get values of the Series                        |
| hist                         | draw histogram                                  |
| min, max, mean, median, mode | self-explanatory                                |
| quantile                     | returns value at a given quantile (0<=q<=1)     |
| replace                      | replace values in Series with a specified value |
| sample                       | return a random sample of values from Series    |
| sort_values                  | sort values                                     |
| to_frame                     | convert to DataFrame                            |
| transpose                    | Returns the transpose                           |
| unique                       | returns a `numpy.ndarray` of unique values      |

### Boolean Subsetting