# The Series Data Structure

In [2]:
import pandas as pd

A **series** is the basic data structure of the Pandas library.

If we create a series from an existing list, the result is a sort of table where each list value is given a corresponding **index** value. If a series is created from a list, the index values default to integers.

In [6]:
# List to turn into a series
dwight = ['Bears', 'Beets', 'Battlestar Galactica']
pd.Series(dwight)

0                   Bears
1                   Beets
2    Battlestar Galactica
dtype: object

Alternatively, we can explicitly set the index by passing another list directly to the series and flagging it as the index list.

In [8]:
# List to turn into the index of a series
categories = ['animal', 'plant', 'show']
dwight_favs = pd.Series(dwight, index=categories)
dwight_favs

animal                   Bears
plant                    Beets
show      Battlestar Galactica
dtype: object

Use the `index` attribute to access the index of a series.

In [9]:
dwight_favs.index

Index(['animal', 'plant', 'show'], dtype='object')

# Querying a Series

There are two ways to access a series. The first is to access by **index position**, which uses auto-assigned integer values.

To access by index position, we use the `iloc` attribute, passing in the position of the value we want (like accessing a list). Note that we use _square brackets_, not parentheses, because `iloc` is an attribute, not a method.

In [12]:
# Access using .iloc attribute
dwight_favs.iloc[2]

'Battlestar Galactica'

Note that _index position_ is different from the more general "index" defined in the last section. The series `dwight_favs` has an explicit index (as we saw above), but it still has hidden index positions for each entry.

The second way to access a series is by **index label**, in which case we use the `loc` attribute, passing in the label of the value we want (like accessing dictionary values using the corresponding keys). Again, note that we use square brackets.

In [13]:
# Access using .loc attribute
dwight_favs.loc['animal']

'Bears'

Pandas is actually optimized further, and allows you to use square bracket notation directly on the series itself. If the index positions and index labels have different types, then Pandas will figure out which one you're using.

In [14]:
# Index positions are ints, so this accesses by iloc
dwight_favs[1]

'Beets'

In [15]:
# Index labels are strings, so this accesses by loc
dwight_favs['animal']

'Bears'

Be careful, though, if your index labels are integers. When this happens, Pandas can't tell whether you're trying to query by index position or index label, so it throws an error. Instead, it's safer to just explicitly use the `iloc` and `loc` attributes.

In [17]:
info = {99: 'hello',
        100: 'goodbye',
        101: 'greetings',
        102: 'farewell'}

s = pd.Series(info)
s

99         hello
100      goodbye
101    greetings
102     farewell
dtype: object

In [18]:
# Error: it's ambiguous whether 0 refers to index position or label
s[0]

KeyError: 0

In [19]:
# Specifies that we're asking for the index *position* 0
s.iloc[0]

'hello'

In [20]:
# Specifies that we're asking for the index *label* 100
s.loc[100]

'goodbye'

# Working with the Data

Let's actually do something with some data. A common task is to do some operation across all values in a series. This may involve trying to find a specific number, summarizing the data, or transforming the data in some way.

For example, let's say we want to find the sum of all the values in the following series.

In [22]:
s = pd.Series([100.0, 120.0, 101.0, 3.0])
s

0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

A typical approach would be to iterate over each item and add them one by one.

In [23]:
total = 0
for num in s:
    total += num
print(total)

324.0


This works, but it's slow. Pandas has a built-in `sum` function that optimizes the operation to make it much faster. This function uses a technique called _vectorization_ to accomplish tasks much more quickly.

In [28]:
# Vectorized sum() function
total = s.sum()
total

324.0

# Broadcasting

**Broadcasting** is the idea of applying a single operation to every value in a series. Pandas does this automatically all the time, so it's important to understand where and how this works.

In [30]:
s = pd.Series([41, 291, 27, 236, 210])
s

0     41
1    291
2     27
3    236
4    210
dtype: int64

In [33]:
# Add 2 to every value in the series
s *= 2
s

0     82
1    582
2     54
3    472
4    420
dtype: int64

All of your typical mathematical operations are vectorized, meaning they will broadcast automatically. If you're interested in making a vectorized function of your own, check out the [NumPy documentation](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html).

In [53]:
terms = ['eigenvalue',
         'eigenvector',
         'linear space',
         'subspace',
         'vector space']

labels = ['term1', 'term2', 'term3', 'term3', 'term5']

lin_alg = pd.Series(terms, index=labels)
lin_alg

term1      eigenvalue
term2     eigenvector
term3    linear space
term3        subspace
term5    vector space
dtype: object

In [46]:
lin_alg.iloc[1:3]

term2     eigenvector
term3    linear space
dtype: object

In [54]:
lin_alg.loc['term1':'term3']

term1      eigenvalue
term2     eigenvector
term3    linear space
term3        subspace
dtype: object

# Other

In addition to accessing existing values, the `loc` attribute can be used to add new ones. If the label passed into `loc` does not exist, then a new entry is added to the series.

In [89]:
s = pd.Series([1, 2, 3])
# Add new label/value pair
s.loc['bestshowever'] = 'avatarthelastairbender'
s

0                                    1
1                                    2
2                                    3
bestshowever    avatarthelastairbender
dtype: object

Note that a series does not require that all index labels are unique. Let's look at an example where this is the case.

In [90]:
# Here's a series where the index labels are all unique
national_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})

# Here's a series where all the index labels are the same
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                     index=['Cricket',
                                            'Cricket',
                                            'Cricket',
                                            'Cricket'])

print(national_sports)
print() # extra space for clarity
print(cricket_loving_countries)

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object


We can append `cricket_loving_countries` to `national_sports`, and save the result into a new series called `all_countries`.

In [92]:
all_countries = national_sports.append(cricket_loving_countries)
all_countries

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

Note that this `append` function works differently from appending to a list. 

When you append to a plain list, the list itself is changed as new values are added to it. But in Pandas, the `append` method creates and returns a *new* series that's a combination of the implicit and explicit parameters. It's common for new Pandas users to get this confused, so watch out for it!

We can verify that `national_sports` remains unchanged by printing it in a cell.

In [93]:
national_sports

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Here's what the appended (a.k.a. combined) series looks like.

In [94]:
all_countries

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

Note that if we query the series for those who have Cricket as their national sport, we get back an entire series, not just a single value.

In [95]:
# Query for all countries with Cricket as their national sport
all_countries.loc['Cricket']

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object