# Pandas Lesson 1: Series

This tutorial introduces the fundamental building block of `pandas`: `Series`. By the end of this section, you will learn how to create different types of Series, subset them, modify them, and summarize them.

## 1. What is a Series?

In the simpliest terms, a `Series` is an ordered collection of values, generally all of the same type. For example, you can have a Series that contains the ages of everyone in your class (a numeric Series), or a Series of all the names of people in your family (a string Series). 

This may sound familiar: isn't that how we described `numpy` vectors (i.e. one-dimensional numpy arrays)? Yes! In fact, Series are basically one-dimensional `numpy` arrays with lots of extra features added on top of them. As we'll see, most everything you could do with a `numpy` array you can do with a Series; Series can just do *more*. 

Series are central to `pandas` because `pandas` was designed for statistics, and Series are a perfect way to collect lots of different observations of a variable.

There are lots of ways to create Series, but the easiest is to just pass a list or an array to the `pd.Series` constructor. 

To illustrate, let me tell you about a week at the zoo I wish I owned. Here's what attendance looked like at my zoo last week:

| Day of Week | Attendees  |
|-------------|------------|
| Monday      | 132 people |
| Tuesday     | 94 people  |
| Wednesday   | 112 people |
| Thursday    | 84 people  |
| Friday      | 254 people |
| Saturday    | 322 people |
| Sunday      | 472 people |

Let's make a Series for this attendance pattern:

In [1]:
import pandas as pd # We have to import pandas to use Series!

attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance

0    132
1     94
2    112
3     84
4    254
5    322
6    472
dtype: int64

## Indices

One of the fundamental differences between `numpy` arrays and Series is that all Series are associated with an `index`. An index is a set of labels for each observation in a Series. If you don't specify an `index` when you create a Series, `pandas` will just create a default index that just labels each row with it's initial row number, but you can specify an index if you want. 

In this case, for example, we know that these entries are associated with different days of the week, so let's specify an index for our `attendance` Series:

In [2]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
                              'Friday', 'Saturday', 'Sunday'])
attendance

Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Now as we see the rows are labeled with days of the week on the left side, rather than with initial row numbers. 

Note that you can always access a Series' index with the `.index` property: 

In [3]:
attendance.index

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')

An important property of index labels is that they stay with each row, even if you sort your data. So if I sort my Series by attendance, not only will rows re-order, but so will the index labels:

In [4]:
attendance = attendance.sort_values()
attendance

Thursday      84
Tuesday       94
Wednesday    112
Monday       132
Friday       254
Saturday     322
Sunday       472
dtype: int64

**Note:** This seems intuitive with days-of-the-week as our index labels, but it can be confusing when your index starts out as row numbers. For example, if you had not changed our index to be days of the week, then the default index would look like the index labels were just row numbers. But if we then *sort* the Series, the numbers will shuffle, and they will no longer correspond to row numbers:

In [5]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance

0    132
1     94
2    112
3     84
4    254
5    322
6    472
dtype: int64

In [6]:
attendance = attendance.sort_values()
attendance

3     84
1     94
2    112
0    132
4    254
5    322
6    472
dtype: int64

## Types of Series

Before we dive too far into Series manipulations, it's important to talk about datatypes. Every Series, as we will see, has a "dtype" (short for datatype). The `dtype` of a Series is important to understand because a Series' `dtype` determines what manipulations you can apply to that series. 

There are, broadly, two types of Series: 

- Numeric: these hold numbers that `pandas` understands are numbers. Specific numeric datatypes include things like `int64`, and `int32` (integers), or `float64` and `float32` (floating point numbers).
- Object: these are Series that can hold any Python object, like strings, numbers, Sets, you name it. They have dtype `O` for "objects". They are flexible, but also very slow and actually harder to work with.

Numeric Series are by far the easiest to work with, and are generally either *integers* (`int64`, `int32`, etc.) or *floating point numbers* (`float64`, `float32`). We'll talk more about the differences between these data types later, but for the moment it's enough to know that *integer* Series (datatypes that start with `int`) can *only* hold... well, integers (whole numbers), while *floating point numbers* Series (datatypes that start with `float`) can hold integers, numbers with decimal points, and even missing values. 

The numbers at the end of these types (`64`, `32`, etc.) have to do with how many actual bits of data are allocated to each number, something we'll discuss later in the course. For the moment, the differences between them don't matter, and in general you'll likely always see (and should use) the `64` suffix.  

You can check the `dtype` of a Series by typing `.dtype`. For example, here are some different kinds of Series:

In [27]:
s = pd.Series([1, 2, 3])
s.dtype

dtype('int64')

In [28]:
s = pd.Series([1, 2, 3.14])
s.dtype

dtype('float64')

In [29]:
s = pd.Series([1, 2, "a string"])
s.dtype

dtype('O')

As you can see, integer (`int64`) Series can *only* hold integers. If we add one number with a decimal component, the whole thing becomes a `float64`. Similarly, floating point Series can only hold numbers. If we add a single String, the whole thing becomes an Object (`O`) type. 

### Converting datatypes

If you want to change the datatype of a Series, you can do so with the `.asdtype()` method... provided a conversion is possible! For example, you can always convert integer arrays to floating point Series because a whole number can be represented as a floating point number (just trust me on this for now... we'll discuss why later!).

In [30]:
s = pd.Series([1, 2, 3])
s = s.astype('float64')
s

0    1.0
1    2.0
2    3.0
dtype: float64

But be careful: since integers can't ever hold decimals, if you try and convert a floating point Series to an integer Series, it will just drop the decimal part of numbers with decimals! 

In [31]:
s = pd.Series([1, 2, 3.14])
s = s.astype('int64')
s

0    1
1    2
2    3
dtype: int64

(Note Pandas is just doing the same thing regular python would do:

In [32]:
int(3.14)

3

But if you try and convert an "object" Series to numeric and there are numbers that can't be converted, `pandas` will throw an error:

In [33]:
s = pd.Series([1, 2, "a string"])
s.astype('float64')

ValueError: could not convert string to float: 'a string'

## 3. Series Arithmetics

One of the nice things about Series is that, like `numpy` arrays, we can easily do things like multiple *all* the values by another number easily. For example, suppose tickets to my zoo cost $15 per person. What is the total money generated by ticket sales each day? Let's find out!

In [34]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
                              'Friday', 'Saturday', 'Sunday'])
attendance

Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

In [35]:
revenue = attendance * 15
revenue

Monday       1980
Tuesday      1410
Wednesday    1680
Thursday     1260
Friday       3810
Saturday     4830
Sunday       7080
dtype: int64

Now what if we want to know to the total amount raised in a week, instead of just the amount on each day? We can use one of R's many helper functions -- in this case `sum` -- which adds up all the values of a Series

In [36]:
revenue.sum()

22050

Cool! 

This is an example of one of the three forms of Series arithmetic:

1. A Series with more than one element and a Series with only one element.
2. A Series modified by a function. 
3. Two Series with the same number of elements. **When working with two Series, elements are matched based on index values, not row numbers**.

But note that the types of things you can do with a Series depends on the Series `dtype`. Math functions, for example, can only be applied to numeric datatypes!

### Summarizing with Functions 

We often want to get summary statistics from a Series --- that is, learn something general about it by looking beyond its constituent elements. If we have a Series in which each element represents a person's height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for numeric Series (some also work for object types):

```python

my_numbers = pd.Series([1, 2, 3, 4])

my_numbers.dtype #check the dtype
len(my_numbers) #number of elements 
my_numbers.max() #maximum value
my_numbers.min() #minimum value
my_numbers.sum() #sum of all values in the Series
my_numbers.mean() #mean
my_numbers.median() #median
my_numbers.var() #variance
my_numbers.std() #standard deviation
my_numbers.quantile() #return specified quantile, 0.5 if none specified
my_numbers.describe() #function that contains many summary stats from above
my_numbers.value_counts() # Tabulate out all the values. Add the argument `normalize=True` to get shares in each big. 
```

Of those, two of the most powerful are `.describe()` (for numeric Series that take on lots of values):

In [37]:
my_numbers = pd.Series(range(100))
my_numbers.describe()

count    100.000000
mean      49.500000
std       29.011492
min        0.000000
25%       24.750000
50%       49.500000
75%       74.250000
max       99.000000
dtype: float64

and `.value_counts()` for numeric series with only a few unique values:

In [38]:
my_numbers = pd.Series([1, 2, 2, 2, 2, 1, 1, -1, -1])
my_numbers.value_counts()

 2    4
 1    3
-1    2
dtype: int64

Note that `.value_counts()` can be combined with the `normalize=True` argument to get the share of observations that have each unique value, rather than the count:

In [39]:
my_numbers.value_counts(normalize=True)

 2    0.444444
 1    0.333333
-1    0.222222
dtype: float64