# The Series data structure

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.Series(['Tiger', 'Bear', 'Moose'])

0    Tiger
1     Bear
2    Moose
dtype: object

In [3]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [4]:
pd.Series(['Tiger', 'Bear', None])

0    Tiger
1     Bear
2     None
dtype: object

In [5]:
pd.Series([1, 2, None])

0    1.0
1    2.0
2    NaN
dtype: float64

In [6]:
np.nan == None

False

In [7]:
np.nan == np.nan

False

In [8]:
np.isnan(np.nan)

True

In [9]:
s = pd.Series({'Archery': 'Bhutan',
               'Golf': 'Scotland',
               'Sumo': 'Japan',
               'Taekwondo': 'South Korea'})
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [10]:
s.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

Separate index creation from the data by passing in the index as a list explicity to the series:

In [11]:
pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])

India      Tiger
America     Bear
Canada     Moose
dtype: object

If list of values in the index object are not aligned with the keys in dictionary for creating the series - pandas overrides the automatic indexes creation and create only values for the indices that you provided.

Pandas will ignore all key/values from dictionary, which are not in provided index. Pandas will add non type or NaN values for any index value you provide, which is not in your dictionary key list.

For example pass in a dictionary of four items, but only two are preserved in the series object, because of the index list. Also `Hockey` has been added, but since it's only in the index list, it has no value associated with it.

In [12]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
idx = ['Golf', 'Sumo', 'Hockey']
pd.Series(sports, index=idx)

Golf      Scotland
Sumo         Japan
Hockey         NaN
dtype: object

## Querying a Series

In [13]:
s = pd.Series({'Archery': 'Bhutan',
               'Golf': 'Scotland',
               'Sumo': 'Japan',
               'Taekwondo': 'South Korea'})
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Series can be queried, either by the index position - `iloc` attribute, or by the index label - `loc` attribute:

In [14]:
s.iloc[2]

'Japan'

In [15]:
s.loc['Sumo']

'Japan'

Pandas provides a sort of smart syntax using the indexing operator directly on the series itself:

In [16]:
s[2]

'Japan'

In [17]:
s['Sumo']

'Japan'

If index is a list of integers Pandas can't determine automatically whether you're intending to query by index position or index label. The safer option is to be more explicit and use the `iloc` or `loc` attributes directly.

For example countries are indexed by integer. If we try to call `s[2]`, we get a key error, because there's no item in the index list with that value:

In [18]:
s = pd.Series({99: 'Bhutan',
              100: 'Scotland',
              101: 'Japan',
              102: 'South Korea'})

# s[2] won't call s.iloc[2] as one might expect, it generates an error instead

 Instead we have to call iloc explicitly if we want the first item:

In [19]:
s.iloc[2]

'Japan'

## Operations

In [20]:
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s

0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

Slow example:

In [21]:
total = 0
for item in s:
    total += item
total

324.0

To use vectorization, which is faster:

In [22]:
total = np.sum(s)
total

324.0

Benchmark both implementations:

In [23]:
s = pd.Series(np.random.randint(0, 1000, 10000))
s.head()

0    657
1    360
2    153
3     62
4    701
dtype: int64

In [24]:
len(s)

10000

In [25]:
%%timeit -n 100
summary = 0
for item in s:
    summary += item

1.13 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [26]:
%%timeit -n 100
summary = np.sum(s)

163 µs ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Related feature in Pandas and NumPy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series.

Adds 2 to each item in s using broadcasting:


In [27]:
s += 2
s.head()

0    659
1    362
2    155
3     64
4    703
dtype: int64

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly:

In [28]:
for label, value in s.iteritems():
    s.set_value(label, value + 2)
s.head()

0    661
1    364
2    157
3     66
4    705
dtype: int64

Benchmark the two approaches:

In [29]:
%%timeit -n 10
s1 = pd.Series(np.random.randint(0,1000,10000))
for label, value in s1.iteritems():
    s1.set_value(label, value + 2)

39.2 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [30]:
%%timeit -n 10
s2 = pd.Series(np.random.randint(0,1000,10000))
s2 += 2

402 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Mixed types for data values or index labels are no problem for Pandas:

In [31]:
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bear'
s

0            1
1            2
2            3
Animal    Bear
dtype: object

An example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.

When using append, Pandas is going to take series and try to infer the best data types to use. The append method doesn't change the underlying series. It instead returns a new series which is made up of the two appended together.

In [32]:
original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia', 'Barbados', 'Pakistan', 'England'], 
                                     index=['Cricket', 'Cricket', 'Cricket', 'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)

In [33]:
all_countries

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

The original series of values haven't changed:

In [34]:
original_sports

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [35]:
cricket_loving_countries

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

When we query the appended series we don't get a single value, but a series itself. This is very similar to relational database - every table query resulting in a return set, which itself is a table:

In [36]:
all_countries.loc['Cricket']

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object