# [CptS 215 Data Analytics Systems and Algorithms](https://github.com/gsprint23/cpts215)
[Washington State University](https://wsu.edu)

[Gina Sprint](http://eecs.wsu.edu/~gsprint/)
# Pandas `Series`

Learner objectives for this lesson:
* Learn about the Pandas library
* Work with Pandas `Series` objects

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)
* Python for Data Analysis by Wes McKinney

## Pandas Overview
From the [Pandas website](http://pandas.pydata.org/):
>pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

The data structures objects in Pandas have axes that are *labeled*. This is quite useful for implementing software for data analytics. There are three main data structures objects in the Pandas library:
1. `Series`: a one dimensional labeled array.
1. `DataFrame`: a two dimensional labeled data structure.
1. `Panel`: a three dimensional labeled data structure.

In this class, we will mostly work with `Series` and `DataFrames`.

Typically, we will import the Pandas library as pd:

In [1]:
import pandas as pd

## `Series`
`Series` is a one dimensional labeled array. The axis labels are collectively referred to as the *index*. Each index value *maps* to a data item in the `Series`. You can think of a `Series` as being similar to an ordered dictionary augmented with data analysis functionality.

Suppose we have a dictionary of city (key) and population (value) pairs that we want to convert into a `Series`:

In [2]:
my_dict = {"Seattle": 652405, "Spokane": 210721, "Bellevue": 133992, "Leavenworth": 1992}
ser = pd.Series(my_dict)
print(ser)

Bellevue       133992
Leavenworth      1992
Seattle        652405
Spokane        210721
dtype: int64


As you can see in the output, the left column is the index (the keys) and the right column is the data (values). The data type of the data (population) is int64.

We can also pass in a list or an `ndarry`.

In [3]:
data = list([1, 2, 3, 4])
ser = pd.Series(data)
print(ser)

import numpy as np
data = np.random.randn(5)
ser = pd.Series(data)
print(ser)

0    1
1    2
2    3
3    4
dtype: int64
0    0.497088
1    0.522258
2    0.634724
3   -0.006576
4   -0.123352
dtype: float64


As you can see in the output, by default the index for a `Series` is numeric: integers starting at 0 and increasing by 1. We can explicitly set the index at instantiation time with the reserved keyword `index`:

In [4]:
import string
alpha = list(string.ascii_lowercase[:5])
data = np.random.randn(5)
ser = pd.Series(data, index=alpha)
print(ser)

a    1.176157
b   -0.398128
c    1.163027
d   -0.143429
e   -1.024642
dtype: float64


### `name` Attribute
Both the data and index of a `Series` can be named. We will see this is especially useful for integrating `Series` into `DataFrames`.

In [5]:
my_dict = {"Seattle": 652405, "Spokane": 210721, "Bellevue": 133992, "Leavenworth": 1992}
ser = pd.Series(my_dict)
print(ser)
ser.name = "Population"
ser.index.name = "City"
print("After naming the data and index")
print(ser)

Bellevue       133992
Leavenworth      1992
Seattle        652405
Spokane        210721
dtype: int64
After naming the data and index
City
Bellevue       133992
Leavenworth      1992
Seattle        652405
Spokane        210721
Name: Population, dtype: int64


### Descriptive and Summary Statistics
Like Numpy ufuncs, Pandas objects (e.g. `Series` and `DataFrames`) included mathematical and statistical methods. Examples of such methods include:
* `count()`: Count the number of non-NA values
* `min()`, `max()`: Compute minimum and maximum values
* `argmin()`, `argmax()`: Compute index locations (integers) at which minimum or maximum value obtained
* `idxmin()`, `idxmax()`: Compute index values (labels) at which minimum or maximum value obtained
* `quantile()`: Compute sample quantile ranging from 0 to 1
* `sum()`: Sum of values
* `cumsum()`: Cumulative sum of values
* `mean()`: Mean of values
* `median()`: Arithmetic median (50% quantile) of values
* Many others!

### Similarities between `Series` and `ndarray`
`Series` is `ndarray`-like, which means `Series` is similar to `ndarray` in many ways, for example:
* `Series` has attributes/methods/ similar to `ndarray`
* You can pass a `Series` instead of an `ndarray` to most NumPy functions (ufuncs)
    * Vectorization
    * Overloaded operators
* You can index and slice `Series` like `ndarray`

In [7]:
my_dict = {"Seattle": 652405, "Spokane": 210721, "Bellevue": 133992, "Leavenworth": 1992}
ser = pd.Series(my_dict)
print(ser)
# attributes
print("ser.shape:%s" %(str(ser.shape)))
print("ser.dtype:%s" %(str(ser.dtype)))
# methods
print("ser.mean():%s" %(str(ser.mean())))
# numpy ufuncs
print("np.mean(ser):%s" %(str(np.mean(ser))))
# vectorization
print("np.sqrt(ser):\n%s" %(str(np.sqrt(ser))))
print("ser + ser:\n%s" %(str(ser + ser)))
print("ser * 10:\n%s" %(str(ser * 10)))
# indeing
print("Indexing ser[0]:%s" %(str(ser[0])))
print("Indexing ser[[0, 2]]:\n%s" %(str(ser[[0, 2]])))
print("Boolean indexing ser[[ser > ser.median()]]:\n%s" %(str(ser[ser > ser.median()])))
print("Slicing ser[0:2]:\n%s" %(str(ser[0:2])))
print("Slicing ser[2:]:\n%s" %(str(ser[2:])))

Bellevue       133992
Leavenworth      1992
Seattle        652405
Spokane        210721
dtype: int64
ser.shape:(4,)
ser.dtype:int64
ser.mean():249777.5
np.mean(ser):249777.5
np.sqrt(ser):
Bellevue       366.049177
Leavenworth     44.631827
Seattle        807.715915
Spokane        459.043571
dtype: float64
ser + ser:
Bellevue        267984
Leavenworth       3984
Seattle        1304810
Spokane         421442
dtype: int64
ser * 10:
Bellevue       1339920
Leavenworth      19920
Seattle        6524050
Spokane        2107210
dtype: int64
Indexing ser[0]:133992
Indexing ser[[0, 2]]:
Bellevue    133992
Seattle     652405
dtype: int64
Boolean indexing ser[[ser > ser.median()]]:
Seattle    652405
Spokane    210721
dtype: int64
Slicing ser[0:2]:
Bellevue       133992
Leavenworth      1992
dtype: int64
Slicing ser[2:]:
Seattle    652405
Spokane    210721
dtype: int64


### Differences between `Series` and `ndarray`
From the [Pandas website](http://pandas.pydata.org/):

>A key difference between `Series` and `ndarray` is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels. The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.