In [49]:
import pandas as pd
import numpy as np
import pyarrow as pa

# Introduction
A panda Series is one of two core pandas objects (the other being a DataFrame). It is a one dimensional array-like object used to model one column or row of data. It is the building block of a DataFrame.

## Simple Object
Showing what really is a pandas Series

In [2]:
# Pure Python
series = {
    "index": [0, 1, 2, 3],
    "data": [145, 142, 38, 13],
    "name": "songs"
}

In [3]:
# Short function to pull items from the data structure
def get(series, idx):
    value_idx = series["index"].index(idx)
    return series["data"][value_idx]

In [4]:
get(series, 1)

142

In [5]:
songs = {
    "index":['Paul', 'John', 'George', 'Ringo'],
    "data":[145, 142, 38, 13],
    "name":"counts"
}

In [6]:
get(songs, "John")

142

The previous example showcases what a pandas Series does, making it possible to support other index types, like strings or dates.

# The pandas Series

In [7]:
# Default NumPy backend
songs2 = pd.Series([145, 142, 38, 13], name="counts")

In [8]:
songs2

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

In [9]:
songs3 = pd.Series([145, 142, 38, 13], name="counts", dtype="int64[pyarrow]")

In [10]:
songs3

0    145
1    142
2     38
3     13
Name: counts, dtype: int64[pyarrow]

The pyarrow backend is more efficient in computation and memory usage, plus it allows for
missing values, a native string type, speed enhancements, and memory
optimizations

In [11]:
songs2.index

RangeIndex(start=0, stop=4, step=1)

The index can be string-based as well, in which case pandas indicates
that the datatype for the index is object (not string):

In [12]:
songs4 = pd.Series(
    [145, 142, 38, 13],
    name="counts",
    dtype="int64[pyarrow]",
    index=['Paul', 'John', 'George', 'Ringo']
)

In [13]:
songs4.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

A series data does not need to be numeric or homogenous:

In [14]:
# Inserting an object
class Foo:
    pass

ringo = pd.Series(
    ['Richard', 'Starkey', 13, Foo()],
    name="ringo"
)

In [15]:
ringo

0                                    Richard
1                                    Starkey
2                                         13
3    <__main__.Foo object at 0x7f63748cc2b0>
Name: ringo, dtype: object

# The NA value
Depending on the backend, we will see NA (pyarrow) or NaN (NumPy) values. However, the NumPy only supports for missing numbers in floats

In [16]:
# This will coerce the type to float
nan_series = pd.Series(
    [2, np.nan],
    index=['Ono', 'Clapton']
)
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

In [17]:
# This will still be of type int!
nan_series2 = pd.Series(
    [2, np.nan],
    index=['Ono', 'Clapton'],
    dtype='int64[pyarrow]'
)
nan_series2

Ono           2
Clapton    <NA>
dtype: int64[pyarrow]

In [18]:
# Pandas ignore NA in arithmetic operations
nan_series2.count()

np.int64(1)

In [19]:
# But it includes it in the total number of entries
nan_series2.size

2

# Similar to NumPy

In [22]:
numpy_ser = np.array([145, 142, 38, 13])

In [38]:
# Comparing with same pyarrow Series (songs3)
print("Series:",songs4.iloc[1]) # Although the iloc can be avoided too, as we can index by name we want to be specific
print("NumPy:",numpy_ser[1])

Series: 142
NumPy: 142


In [39]:
# There are some similar methods, like mean
print("Series mean:", songs4.mean())
print("NumPy mean:", numpy_ser.mean())

Series mean: 84.5
NumPy mean: 84.5


We can actually check all methods that are common to both types:

In [40]:
print("Number of common methods:", len(set(dir(numpy_ser)) & set(dir(songs4))))

Number of common methods: 111


Here is the full list:

In [41]:
set(dir(numpy_ser)) & set(dir(songs4))

{'T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__si

We can also use boolean arrays to work as masks for filtering items

In [42]:
mask = songs4 > songs4.median()

In [43]:
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool[pyarrow]

In [44]:
# Using the mask as a filter:
songs4[mask]

Paul    145
John    142
Name: counts, dtype: int64[pyarrow]

NumPy also supports filtering with boolean arrays, but does not have the .median method. It has nonetheless a median function:

In [45]:
mask = numpy_ser > np.median(numpy_ser)
numpy_ser[mask]

array([145, 142])

# Categorical Data

In [47]:
s = pd.Series(["s", "m", "l"], dtype= "category")

In [48]:
s

0    s
1    m
2    l
dtype: category
Categories (3, object): ['l', 'm', 's']

In [63]:
# In this case the series represents size, so there should be a natural order, but by default it doesn't have
s.cat.ordered

False

We can create a type so that we order them

In [64]:
s2 = pd.Series(["m","l","xs","s","xl"], dtype="string[pyarrow]")
size_type = pd.CategoricalDtype(
    categories=["s","m","l"],
    ordered= True
)

In [65]:
s3 = s2.astype(size_type)

In [66]:
s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

In the previous example, as there were some categories not covered by the size_type, they are transformed to NaN when casting with .astype

In [67]:
# Given the ordering, only M and L should be True
s3>"s"

0     True
1     True
2    False
3    False
4    False
dtype: bool

We can also add ordering information to existing categorical data (instead of creating from scratch). However, we need to make sure to cover all categories or we would get a ValueError as per below:

In [68]:
s = pd.Series(["s","m","l"], dtype= "category")
s.cat.reorder_categories(["xs","s","m","l","xl"], ordered=True)

ValueError: items in new_categories are not the same as in old categories

To avoid this, we need to make sure that we have all categories in the initial Series:

In [69]:
(
    s
    .cat.add_categories(["xs","xl"])
    .cat.reorder_categories(["xs","s","m","l","xl"], ordered=True)
)

0    s
1    m
2    l
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

In [77]:
# As a sidenote, we can take advantage off different atributes of the string and datetime series to perform common operations, even if they are categorical. 
# However, this would transform it into the object type deleting the category!:
s3.str.upper()

0      M
1      L
2    NaN
3      S
4    NaN
dtype: object

In [79]:
# Hence, we would need to rearrange categories:
(
    s3
    .str.upper()
    .astype("category")
    .cat.reorder_categories(["S","M","L"], ordered=True)
)

0      M
1      L
2    NaN
3      S
4    NaN
dtype: category
Categories (3, object): ['S' < 'M' < 'L']

### Note on pyarrow

Pyarrow also has a category type called dictionary, but there is no convenient way of creating one, hence it is recommended to use pandas "category" type.

In [50]:
# Example of Pyarrow's dictionary type
dict_type = pd.ArrowDtype(pa.dictionary(pa.int64(), pa.utf8()))

In [53]:
size = pd.Series(["m","l","xs","s","xl"], dtype=dict_type)

In [54]:
size

0     m
1     l
2    xs
3     s
4    xl
dtype: dictionary<values=string, indices=int64, ordered=0>[pyarrow]

In [58]:
# If one were to save a categorical column into a feather file, once it is read back it will be of the dictionary type:
(
    pd.Series(["s","m","l"], dtype="category")
    .rename("size")
    .to_frame()
    .to_feather("/tmp/cat.ft")
)

In [62]:
(
    pd.read_feather("/tmp/cat.ft", dtype_backend="pyarrow")
    .loc[:, "size"]
    .dtype
)

dictionary<values=string, indices=int8, ordered=0>[pyarrow]

# Excercises

In [83]:
# 1. Create a series with the temperature values for the last
# seven days. Filter out the values below the mean.
temp = pd.Series([30,34,28,25,30,32,29])
print(temp[temp>temp.mean()])
print("mean:",temp.mean())

0    30
1    34
4    30
5    32
dtype: int64
mean: 29.714285714285715


In [None]:
# 2. Create a series with your favorite colors. Use a
# categorical type.

In [84]:
colours = pd.Series(["Red", "Blue", "Black"], dtype="category")
colours

0      Red
1     Blue
2    Black
dtype: category
Categories (3, object): ['Black', 'Blue', 'Red']

# Extra
The Zen of Python!

In [46]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
