# `pandas` playbook: basics

Mal Minhas, January 2018, updated Dec 2021

Crash run through core `pandas` concepts in one handy notebook.

## 1. `pandas` objects

Taken from lesson [here](https://www.oreilly.com/learning/introducing-pandas-objects).  `pandas` objects can be thought of as enhanced versions of `numpy` structured arrays in which the rows and columns are identified with labels rather than simple integer indices.  There are three fundamental `pandas` objects: the **Series**, **DataFrame**, and **Index**.

In [1]:
import pandas as pd
import numpy as np
pd.__version__

'1.3.5'

#### 1.1 Series

The essential difference is the presence of the index: while the Numpy Array has an _implicitly_ defined integer index used to access the values, the Pandas Series has an _explicitly_ defined index associated with the values.  For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:

In [2]:
series = pd.Series([0.25, 0.5, 0.75, 1.0])
series

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
series = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
series

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [4]:
series.tolist()

[0.25, 0.5, 0.75, 1.0]

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. 
The series-as-dict analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [5]:
population_dict = {'California': 38332521,'Texas': 26448193,'New York': 19651127,
                   'Florida': 19552860,'Illinois': 12882135}
population_series = pd.Series(population_dict)
population_series

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [6]:
population_series.tolist()

[38332521, 26448193, 19651127, 19552860, 12882135]

#### 1.2 DataFrame

A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index.  Let's demonstrate this by creating an `area` Series to go with the `population` series above:

In [7]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
area_series = pd.Series(area_dict)
area_series

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [8]:
df = pd.DataFrame({'population': population_series,'area': area_series})
print(f"type of population_series='{type(population_series)}', type of area_series='{type(area_series)}', type of df='{type(df)}'")
df

type of population_series='<class 'pandas.core.series.Series'>', type of area_series='<class 'pandas.core.series.Series'>', type of df='<class 'pandas.core.frame.DataFrame'>'


Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [9]:
df.area.tolist()

[423967, 695662, 141297, 170312, 149995]

#### 1.3 Index

The Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.

In [10]:
index1 = pd.Index(['area','population',3,4,5])
index2 = pd.Index(['area',8,6,7])
intersect = index1.intersection(index2)
intersect

Index(['area'], dtype='object')

In [11]:
df[intersect]

Unnamed: 0,area
California,423967
Texas,695662
New York,141297
Florida,170312
Illinois,149995


## 2. Indexing and selection

See [here](https://www.oreilly.com/learning/data-indexing-and-selection) for more details.

#### 2.1 Series as a dictionary

In [12]:
series = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
print(series['b'])
print(series.b)
print(series.keys())
series

0.5
0.5
Index(['a', 'b', 'c', 'd'], dtype='object')


a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

#### 2.2 Series as 1D array

A Series builds on this dictionary-like interface and provides array-style item selection via slices, masking, and fancy indexing, 
Notice that when slicing with an explicit index (i.e. `data['a':'c']`), the final index is included in the slice, while when slicing with an implicit index (i.e. `data[0:2]`), the final index is excluded from the slice.

In [13]:
# slicing by explicit index
series['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [14]:
# slicing by implicit integer index
series[0:2]

a    0.25
b    0.50
dtype: float64

In [15]:
# masking
series[(series > 0.3) & (series < 0.8)]

b    0.50
c    0.75
dtype: float64

In [16]:
# fancy indexing
series[['a', 'd']]

a    0.25
d    1.00
dtype: float64

#### 2.3 Indexers: `.loc[]`,`.iloc[]`,`.ix[]`

Because of this potential confusion in the case of integer indexes, `pandas` provides some special indexer attributes which explicitly access certain indexing schemes. These are not functional methods, but attributes which expose a particular slicing interface to the data in the Series.  First, the `loc` attribute allows indexing and slicing which always references the explicit index:

In [17]:
series = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
series.loc[1]

'a'

In [18]:
series.loc[1:3]

1    a
3    b
dtype: object

The `iloc` attribute allows indexing and slicing which always references the implicit Python-style index:

In [19]:
series.iloc[1]

'b'

In [20]:
series.iloc[1:3]

3    b
5    c
dtype: object

A third indexing attribute, `ix`, is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing.  Note **`ix` is deprecated now**.  The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes.

In [21]:
try:
    series.ix[1:3]
except Exception as e:
    print(e)

'Series' object has no attribute 'ix'


Let's create a dataframe `df` using two series:

In [22]:
area_series = pd.Series({'California': 423967, 'Texas': 695662,'New York': 141297, 
                  'Florida': 170312,'Illinois': 149995})
population_series = pd.Series({'California': 38332521, 'Texas': 26448193,'New York': 19651127, 
                 'Florida': 19552860,'Illinois': 12882135})
df = pd.DataFrame({'area':area_series, 'pop':population_series})
df

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


We can use attribute-style access with column names which are strings. For example:

In [23]:
df.area is df['area']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the dataframe, this attribute-style access is not possible. For example, the DataFrame has a `pop` method, so `data.pop` will point to this rather than the `pop` column:

In [24]:
df.pop is df['pop']

False

`.loc[]` is used for accessing a group of rows and columns by **label(s)** or a boolean array.  Here's an example by label:

In [25]:
df.loc["Texas","pop"]

26448193

Here's an example by boolean array pulling out the same value which returns a dataframe:

In [26]:
df.loc[[False,True,False,False,False],[False,True]]

Unnamed: 0,pop
Texas,26448193


**Explicit** access by label is up to and including the final index - this `.loc[]` request gets us the whole dataframe:

In [27]:
df.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


.iloc[] is primarily by **integer index** (from 0 to length-1 of the axis), but may also be used with a boolean array.

In [28]:
df.iloc[1,1]

26448193

**Implicit** access is up to but excluding the final index.

In [29]:
df.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
