# Introduction to Pandas

Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools. The core of pandas is its *dataframe* which is essentially a table of data. Pandas provides easy and powerful ways to import data from a variety of sources and export it to just as many. It is also explicitly designed to handle *missing data* elegantly which is a very common problem in data from the real world.

The offical [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) is very comprehensive and you will be able to answer a lot of questions in there, however, it can sometimes be hard to find the right page. Don't be afraid to use Google to find help.

## Series

The simplest of pandas' data structures is the `Series`. It is a one-dimensional list-like structure. Let's start by importing it from the `pandas` module:

In [1]:
from pandas import Series

We can create a `Series` from a `list`:

In [2]:
Series([14, 7, 3, -7, 8])

0    14
1     7
2     3
3    -7
4     8
dtype: int64

There are three main components to this output.
The first column (`0`, `2`, etc.) is the index, by default this numbers each row starting from zero.
The second column is our data, stored in the same order we entered it in our list.
Finally at the bottom there is the `dtype` which stands for 'data type' which is telling us that all our data is being stored as a 64-bit integer.
Usually you can ignore the `dtype` until you start doing more advanced things.

In the first example above we allowed pandas to automatically create an index for our `Series` (this is the `0`, `1`, `2`, etc. in the left column) but often you will want to specify one yourself

In [3]:
s = Series([14, 7, 3, -7, 8], index=["a", "b", "c", "d", "e"])
print(s)

a    14
b     7
c     3
d    -7
e     8
dtype: int64


We can use this index to retrieve individual rows

In [4]:
s["a"]

14

to replace values in the series

In [5]:
s["c"] = -1

or to get a set of rows

In [6]:
s[["a", "c", "d"]]

a    14
c    -1
d    -7
dtype: int64

### Exercise

- Create a Pandas `Series` with 10 or so elements where the indices are years and the values are numbers.
  - Make sure that the indices are set as integers, not strings.
  - Experiment with retrieving elements from the `Series`.
  - [<small>answer</small>](answer_series_years.ipynb)
- Try making another `Series` with duplicate values in the index, what happens when you access those elements? [<small>answer</small>](answer_series_duplicate_index.ipynb)
- How does a Pandas `Series` differ from a Python `list` or `dict`? [<small>answer</small>](answer_series_list_dict.ipynb)

## Series operations

A `Series` is `list`-like in the sense that it is an ordered set of values. It is also `dict`-like since its entries can be accessed via key lookup. One very important way in which is differs is how it allows operations to be done over the whole `Series` in one go, a technique often referred to as 'broadcasting'.

A simple example is wanting to double the value of every entry in a set of data. In standard Python, you might have a list like

In [5]:
my_list = [3, 6, 8, 4, 10]

If you wanted to double every entry you might try simply multiplying the list by `2`:

In [6]:
my_list * 2

[3, 6, 8, 4, 10, 3, 6, 8, 4, 10]

but as you can see, that simply duplicated the elements. Instead you would have to use a `for` loop or a list comprehension:

In [7]:
[i * 2 for i in my_list]

[6, 12, 16, 8, 20]

With a pandas `Series`, however, you can perform bulk mathematical operations to the whole series in one go:

In [8]:
my_series = Series(my_list)
print(my_series)

0     3
1     6
2     8
3     4
4    10
dtype: int64


In [11]:
my_series * 2

0     6
1    12
2    16
3     8
4    20
dtype: int64

In [12]:
my_series < 8

0     True
1     True
2    False
3     True
4    False
dtype: bool

## Querying

As well as bulk modifications, you can perform bulk selections by putting more complex statements in the square brackets. For example, if we want to get back all the entries which are less than zero we can do:

In [13]:
s[s < 0]  # All negative entries

c   -1
d   -7
dtype: int64

We can see how this works by breaking it down into smaller steps. First we'll look at just the original `Series`:

In [14]:
s

a    14
b     7
c    -1
d    -7
e     8
dtype: int64

If we do a broadcast boolean operation on it, we get back a `Series` which contains only `True`s and `False`s:

In [15]:
s < 0

a    False
b    False
c     True
d     True
e    False
dtype: bool

Here you can see that the rows `a`, `b` and `e` are `True` while the others are `False`.

A handy feature of `Series` is that as well as passing a single index value into the square brackets, you can also pass a `Series` of booleans. Pandas will then match together the indices between the `Series` and filter-out any that are `False`, leaving only `c` and `d`:

In [16]:
s[s < 0]

c   -1
d   -7
dtype: int64

## Multi-Series operations

It is also possible to perform operations between two `Series` objects:

In [17]:
s2 = Series([23,5,34,7,5])
s3 = Series([7, 6, 5,4,3])
s2 - s3

0    16
1    -1
2    29
3     3
4     2
dtype: int64

### Exercise

- Create two `Series` objects of equal length with no specified index and containing any values you like. Perform some mathematical operations on them and experiment to make sure it works how you think. [<small>answer</small>](answer_series_multi.ipynb)
- What happens then you perform an operation on two series which have different lengths? How does this change when you give the series some indices? [<small>answer</small>](answer_series_multi_unequal.ipynb)