Lecture: AI I - Basics 

Previous:
[**Chapter 3.1: Numpy**](../03_data/01_numpy.ipynb)

---

# Chapter 3.2: Pandas

- [The Pandas Series Object](#the-pandas-series-object)
- [The Pandas DataFrame Object](#the-pandas-dataframe-object)
- [The Pandas Index Object](#the-pandas-index-object)
- [Indexing, Selection, and Assignment of Data](#indexing-selection-and-assignment-of-data)
- [Reading Series and DataFrames](#reading-series-and-dataframes)
- [Ufuncs and Aggregation](#ufuncs-and-aggregation)
- [Group By](#group-by)
- [Aggregate, Filter, Transform, Apply](#aggregate-filter-transform-apply)


Pandas is a library for fast and efficient computation on large datasets.  
As in NumPy, many operations in Pandas are vectorized, making them efficient and fast.

Pandas is a newer package built on top of NumPy that provides an efficient implementation of a DataFrame.  
DataFrames are essentially multidimensional arrays with attached row and column labels, often containing heterogeneous types and/or missing data.  
In addition to providing a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks (→ relational algebra) and spreadsheet programs.

As we have seen, NumPy’s ndarray data structure provides essential features for the kind of clean, well-organized data typically encountered in numerical computing tasks.  
While it serves this purpose very well, its limitations become apparent when we need more flexibility (e.g., attaching labels to data, handling missing values, etc.) or when we attempt operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.).  
These are each key components of analyzing the less structured data that exists in many forms in the world around us.  

Pandas, and in particular its Series and DataFrame objects, build on the NumPy array structure and provide efficient access to these kinds of “data manipulation” tasks, which take up a large portion of a data scientist’s time.

Just as we usually import NumPy as `np`, we import Pandas under the alias `pd`.  
We also import NumPy, since we will often need it when working with Pandas:


In [1]:
import numpy as np
import pandas as pd

## The Pandas `Series` Object

A Pandas `Series` is a one-dimensional array with indexed data.  
It can be created from a list or an array as follows:


In [2]:
# Series with missing values 
data = pd.Series([0.25, 0.5, np.nan, 1.0])
data

0    0.25
1    0.50
2     NaN
3    1.00
dtype: float64

In [3]:
type(data)

pandas.core.series.Series

In [4]:
data.values, type(data.values)

(array([0.25, 0.5 ,  nan, 1.  ]), numpy.ndarray)

The index is an array-like object of type `pd.Index`:


In [5]:
data.index, type(data.index), list(data.index)

(RangeIndex(start=0, stop=4, step=1),
 pandas.core.indexes.range.RangeIndex,
 [0, 1, 2, 3])

As with a list or a NumPy array, the data can be accessed via the associated index using the familiar Python square-bracket notation:


In [6]:
data[1:3]

1    0.5
2    NaN
dtype: float64

In [7]:
type(data[1])

numpy.float64

In [8]:
print(dir(data))

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__column_consortium_standard__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pandas_priority__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfl

### Series as a Generalized NumPy Array

From what we’ve seen so far, the Series object might appear to be essentially interchangeable with a one-dimensional NumPy array.  
The key difference, however, is the presence of the index: while a NumPy array has an implicitly defined integer index used to access its values, a Pandas Series has an explicitly defined index that is linked to its values.

This explicit index definition gives the Series object additional capabilities.  
For example, the index does not have to be an integer; it can consist of values of any type.  
If we want, we can use strings as the index:


In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'd', 'c'])
data

a    0.25
b    0.50
d    0.75
c    1.00
dtype: float64

In [10]:
data.index = list("AbCD")
data

A    0.25
b    0.50
C    0.75
D    1.00
dtype: float64

In [11]:
data["b"] == data[1] == data.values[1]

  data["b"] == data[1] == data.values[1]


np.True_

In [12]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[3, 7, 2, 4])
data

3    0.25
7    0.50
2    0.75
4    1.00
dtype: float64

In [13]:
data.index

Index([3, 7, 2, 4], dtype='int64')

When an explicit index is present, it takes precedence! (*As long as we are not slicing!*)


In [14]:
data[3]

np.float64(0.25)

In [15]:
data[1:3]

7    0.50
2    0.75
dtype: float64

In [16]:
type(data[3])

numpy.float64

Note that explicit indices do not have to be unique!


In [17]:
d2 = pd.concat([data, data])
d2

3    0.25
7    0.50
2    0.75
4    1.00
3    0.25
7    0.50
2    0.75
4    1.00
dtype: float64

In [18]:
d2[7]

7    0.5
7    0.5
dtype: float64

In [19]:
d2[7] = 2

In [20]:
d2

3    0.25
7    2.00
2    0.75
4    1.00
3    0.25
7    2.00
2    0.75
4    1.00
dtype: float64

### Series as a Specialized Dictionary

In this sense, you can think of a Pandas Series as a specialized version of a Python dictionary.  
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, while a Series is a structure that maps typed keys to a set of typed values.  

This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient for certain operations than a Python list, the type information of a Pandas Series makes it much more efficient than a Python dictionary for specific operations.

The Series-as-dictionary analogy becomes even clearer when a Series object is constructed directly from a Python dictionary:


In [21]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [22]:
population['Texas']

np.int64(26448193)

In contrast to a dictionary, however, a Series also supports array-like operations such as slicing:


In [23]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Note that Illinois is included!


### Creating Series Objects

Data can be a scalar:


In [24]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

Data can be a dictionary:


In [25]:
ser = pd.Series({2:'a', 1:'b', 3:'c'})
ser

2    a
1    b
3    c
dtype: object

In [26]:
ser.to_dict()

{2: 'a', 1: 'b', 3: 'c'}

In [27]:
list(ser.values)

['a', 'b', 'c']

---

Lecture: AI I - Basics 

Excersie: [**Excersie 3.2: Pandas**](../03_data/exercises/02_pandas.ipynb)

Next: [**Chapter 3.3: Visualisation with Matplotlib**](../03_data/03_matplotlib.ipynb)