In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML

# 2 - Data structures

## Outline

Goal: *Provide an overview of main data structures in pandas.*

Key topics:

- Data structures overview
- Series
- DataFrame
- Index
- NDFrame vs ndarray

## Data structures overview

Pandas makes use of the following main data structures:

- **Series**: 1D container for data
- **DataFrame**: 2D container for Series
- **Index**: holds labels and indices

Key properties: 
- data alignment is intrinsic, unless you decide differently
- automatic handling of missing data
- leverage power of existing numpy methods 

## Series

One-dimensional ndarray with axis labels (1D subclass of NDFrame).

Key properties:

- homogeneously-typed; single data type
- value-mutable, but size-immutable
- operations align on index values

The Series object:

    pd.Series(data=None, index=None, dtype=None, name=None, copy=False)

- **data** : array-like, dict or scalar data
- **index** : (non-unique) values must be hashable and same length as data, defaults to `np.arange(len(data))` 
- **dtype** : specify `numpy.dtype` or set to `None` to auto infer
- **name**: name of the series, similar to DataFrame column name
- **copy** : copy input data, by default `False`, note that this only affects ndarray based input

Examples of creating a Series object:

In [2]:
se = pd.Series({'b1': True, 'b2': False, 'b3': False}, name='values')  # from dict
se

b1     True
b2    False
b3    False
Name: values, dtype: bool

In [3]:
se = pd.Series(np.arange(3), index=['a', 'b', 'c'], dtype=int)  # from ndarray
se

a    0
b    1
c    2
dtype: int64

A Series is basically an `index` and a set of `values`:

In [4]:
se = pd.Series([1., 1.5, 2., 2.5])
se

0    1.0
1    1.5
2    2.0
3    2.5
dtype: float64

In [5]:
se.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
se.values

array([1. , 1.5, 2. , 2.5])

Series behave very `ndarray`-like:

In [7]:
se = pd.Series(5., index=['a', 'b', 'c'])
se

a    5.0
b    5.0
c    5.0
dtype: float64

In [8]:
se[0]  # integer indexing

5.0

In [9]:
np.log(se ** 2)  # elementwise operations

a    3.218876
b    3.218876
c    3.218876
dtype: float64

except for the automatic alignment based on index labels:

In [10]:
se[:2] + se[1:]

a     NaN
b    10.0
c     NaN
dtype: float64

the `ndarray` version would give:

In [11]:
se[:2].values + se[1:].values

array([10., 10.])

Series also behave very dict like:

In [12]:
se['a']

5.0

In [13]:
se.keys()

Index(['a', 'b', 'c'], dtype='object')

In [14]:
se.get('c')

5.0

In [15]:
se.get('d')

Other ways of accessing the data:

In [16]:
pop = pd.Series({'DE': 81.3, 'BE': 11.3, 'FR': 64.3, 
                 'UK': 64.9, 'NL': 16.9})
pop

DE    81.3
BE    11.3
FR    64.3
UK    64.9
NL    16.9
dtype: float64

In [17]:
pop['BE':'UK']  # slicing incl.

BE    11.3
FR    64.3
UK    64.9
dtype: float64

In [18]:
pop[pop < 30]  # boolean indexing

BE    11.3
NL    16.9
dtype: float64

In [19]:
pop[['DE', 'NL']]  # list of labels

DE    81.3
NL    16.9
dtype: float64

## DataFrame

Two-dimensional NDFrame with both row indices and column labels; a dict-like container for Series objects. 

Key properties:

- heterogeneously-typed; differently typed columns
- value and size-mutable (especially in column dimension)
- arithmetic operations align on both row and column labels
- the most commonly used pandas object

The DataFrame object:
```python
pd.DataFrame(data=None, index=None, columns=None, 
             dtype=None, copy=False)
```
- **data** : array-like, dict or scalar data
- **index** : (non-unique) values must be hashable and same length as data, defaults to `np.arange(n)`
- **columns** : similar to index, but then in column direction
- **dtype** : specify `numpy.dtype` or set to `None` to auto infer
- **copy** : copy input data, by default `False`, note that this only affects ndarray based input

Examples of creating a DataFrame object:

In [20]:
df = pd.DataFrame(np.ones((2,3)), columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,1.0,1.0,1.0
1,1.0,1.0,1.0


In [21]:
df = pd.DataFrame({'col_1':[1,2,3], 'col_2':['a','b','c']}, 
                  index=['row_1','row_2','row_3'])
df

Unnamed: 0,col_1,col_2
row_1,1,a
row_2,2,b
row_3,3,c


In [22]:
df.dtypes

col_1     int64
col_2    object
dtype: object

### [dtypes](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes)

The main types stored in pandas objects are float, int, bool, datetime64[ns] and datetime64[ns, tz], timedelta[ns], category and object. In addition these dtypes have item sizes, e.g. int64 and int32.

A DataFrame is made-up out of two indexes and a set of values:

In [23]:
df = pd.DataFrame(np.random.randint(10, size=(3,3)), 
                  index=['r0', 'r1', 'r2'],
                  columns=['c0', 'c1', 'c2'])
df

Unnamed: 0,c0,c1,c2
r0,6,3,1
r1,7,9,6
r2,8,7,7


In [24]:
df.index

Index(['r0', 'r1', 'r2'], dtype='object')

In [25]:
df.columns

Index(['c0', 'c1', 'c2'], dtype='object')

In [26]:
df.values

array([[6, 3, 1],
       [7, 9, 6],
       [8, 7, 7]])

Other constructors:
- from dict of array-like or dicts:<br>
    `DataFrame.from_dict(data, orient="columns", dtype=None)`
- from sequence of (key, value) pairs:<br>
    `DataFrame.from_items(items, columns=None, orient="columns")`
- from tuples, also record arrays:<br>
    `DataFrame.from_records(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)`


and of course many I/O methods that we will discuss later-on..

Conversion of a Series to DataFrame:

- `Series.to_frame(name=None)`
- `pd.DataFrame` constructor

In [27]:
s = pd.Series(np.random.randint(0,10,size=3), index=['a', 'b', 'c'])
df = s.to_frame(name='rand_int')
df

Unnamed: 0,rand_int
a,6
b,7
c,6


In [28]:
df1 = pd.DataFrame(s, columns=['rand_int'])
df1

Unnamed: 0,rand_int
a,6
b,7
c,6


## Index

Ordered multiset (=set that allows duplicates) that contains __axis labels__ and other __meta data__ for all pandas objects. It provides the means for __quickly accessing__ the stored data.

Key properties:

- immutable both in value and size
- comes in several flavours depending on data type and structure
- provides the infrastructure for lookups, data alignment and reindexing

The Index object:
    
    pd.Index(data=None, dtype=None, copy=False, 
             name=None, tupleize_cols=True)

- **data** : array-like (1-dimensional)
- **dtype** : NumPy dtype (default: object)
- **copy**: False
- **name**: Name of the index
- **tupleize_cols**: attempt to create MultiIndex if possible

Examples of creating and Index:


In [29]:
idx = pd.Index(['i1', 'i2', 'i3'])
idx

Index(['i1', 'i2', 'i3'], dtype='object')

In [30]:
idx = pd.Index([1, 2, 3], dtype=float, name='the_index')
idx

Float64Index([1.0, 2.0, 3.0], dtype='float64', name='the_index')

In [31]:
idx = pd.Index([(1, 'a'), (1, 'b'), (2, 'c')], 
               names=['lvl_1', 'lvl_2'], tupleize_cols=False)
idx

Index([(1, 'a'), (1, 'b'), (2, 'c')], dtype='object')

In [32]:
idx = pd.Index([(1, 'a'), (1, 'b'), (2, 'c')], 
               names=['lvl_1', 'lvl_2'], tupleize_cols=True)
pd.DataFrame(np.zeros(3), index=idx)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
lvl_1,lvl_2,Unnamed: 2_level_1
1,a,0.0
1,b,0.0
2,c,0.0


The labels supplied to the DataFrame constructor are internally converted into an Index object:

In [33]:
df = pd.DataFrame(np.ones((3, 3)), index=[3, 4, 5], 
                  columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
3,1.0,1.0,1.0
4,1.0,1.0,1.0
5,1.0,1.0,1.0


Moreover, an index can also be created out of one or more columns of a data frame:

In [34]:
df.set_index('A')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,1.0,1.0
1.0,1.0,1.0
1.0,1.0,1.0


Index objects are immutable; allowing safe sharing among different data structures:

In [35]:
# NBVAL_RAISES_EXCEPTION
a = np.array([1, 2, 3, 4])
idx = pd.Index(a)
print(idx)
idx[2] = 5

Int64Index([1, 2, 3, 4], dtype='int64')


TypeError: Index does not support mutable operations

## Attributes and underlying data

DataFrame, Series and Index objects have a set of attributes and methods that provide information regarding:

- the data structure itself
- and underlying stored data

Most of these metadata attributes are very similar to the Numpy `ndarray` attributes.

An overview of DataFrame and Series metadata attributes/methods:

<table style="border-collapse:collapse;border-spacing:0"><tr><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">attribute/method</th><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">returns</th><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">description</th></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.axes`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">list</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">row and columns index objects</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.index`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">Index</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">the row index</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`df.columns`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">Index</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">the column index</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`s.dtype`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">np.type</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">data type of the Series</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`df.dtypes`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">Series</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">data types of the columns</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`df.get_dtype_counts()`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">Series</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">data type counts</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.empty`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">bool</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">True if NDFrame is empty</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.shape`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">tuple</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">dimensionality of NDFrame</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.size`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">int</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">total number of elements</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.values`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">ndarray</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">Numpy representation of the data</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.memory_usage(index, …)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">Series</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">memory usage of the columns</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`df.info(verbose, …)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">-</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">summary of a DataFrame</td></tr></table>

Note that several of the above attributes/methods are also valid for Index objects.

Some examples of accessing the metadata:

In [36]:
df = pd.DataFrame([[i+j for i in range(3)] for j in range(3)], 
                  index=[3, 4, 5], columns=['A', 'B', 'C'])

In [37]:
print(df.shape)
print(df.size)

(3, 3)
9


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 3 to 5
Data columns (total 3 columns):
A    3 non-null int64
B    3 non-null int64
C    3 non-null int64
dtypes: int64(3)
memory usage: 96.0 bytes


In [39]:
df.axes == [df.index, df.columns]

True

In [40]:
s = df['B']
print(s.size)

3


New indexes can be set by assigning to `obj.index` and `df.columns` attributes:

In [41]:
df

Unnamed: 0,A,B,C
3,0,1,2
4,1,2,3
5,2,3,4


In [42]:
df.columns = df.columns[::-1]
df.index = [0, 1, 2]
df

Unnamed: 0,C,B,A
0,0,1,2
1,1,2,3
2,2,3,4


The Index has some specific attributes of its own, for example:

<table style="border-collapse:collapse;border-spacing:0"><tr><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">attribute/method</th><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">returns</th><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">description</th></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`idx.has_duplicates`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">bool</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">True if index contains duplicate labels</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`idx.is_unique`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">bool</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">True if index contains only unique labels</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`idx.is_monotonic`<br>`idx.is_monotonic_decreasing`<br>`idx.is_monotonic_increasing`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">bool</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal">True if index is monotonic in specific direction</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`idx.is_all_dates`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">bool</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">index contains only date formatted entries</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`idx.nlevels`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">Index</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">number of multilevel indexes</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`idx.tolist()`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">list</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">convert index to list</td></tr></table>

Some examples of accessing an Index's metadata:

In [43]:
idx = pd.Index([1, 3, 4, 5])
idx

Int64Index([1, 3, 4, 5], dtype='int64')

In [44]:
print(idx.is_unique)
print(idx.is_monotonic_increasing)
print(idx.size)

True
True
4


In [45]:
idx = pd.Index([pd.Timestamp('2013'), pd.Timestamp('2012'), 
                pd.Timestamp('2011')])
idx

DatetimeIndex(['2013-01-01', '2012-01-01', '2011-01-01'], dtype='datetime64[ns]', freq=None)

In [46]:
print(idx.is_all_dates)
print(idx.is_monotonic_decreasing)
print(idx.is_monotonic_increasing)

True
True
False


## Exercises: [lab 2 - Data structures](lab_02_data_structures.ipynb)