_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____

# Series

Series' data can be many different things:

- a Python dict
- an `ndarray`
- a scalar value (like 2020)

A Pandas Series is like a column in a table: it is a one-dimensional array holding data of any type.

So, in essence
- **Series is a one-dimensional labeled array** capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 
- The **axis labels** are collectively referred to as the index. 



![images/01_table_series.svg](images/01_table_series.svg)


Let us start by importing the necessary libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

## Creating a Series

To create a `Series` you can use the `Series` constructor:

```s = pd.Series(data, index=index)```

where `data` can be many different things.

If `data` is not given, it defaults to an empty Series.

In [None]:
s = pd.Series(dtype='float')
s


### From ` ndarray `
- If data is an `ndarray`, index must be the same length as data. 
- If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [None]:
v = np.random.normal(size=5)
print(type(v))
v

The index can be passed directly as a parameter

In [None]:
s = pd.Series(v, 
              index=tuple('abcde'))
s

### From dict

Series can be instantiated from dictionaries.

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order (Python >= 3.6 and Pandas >= 0.23).

In [None]:
d = {k: v for k, v in zip('abcde', np.random.normal(size=5))}
d

In [None]:
s = pd.Series(d)
s

### From Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [None]:
s = pd.Series(1.0, index=tuple('olivia'))
s

## Accessing data

Let us create a series to illustrate the access to the data

In [None]:
s = pd.Series(v, index=tuple('abcde'))
s

The index is given by the `index` attribute

In [None]:
s.index

The np.array of values is given by the `values` attribute

In [None]:
s.values

And, **values can be accessed by label index** (like a dict) or by position index (like a list). 

For example, the dot notation can be used to access the values by label index

In [None]:
s.a

or access using the index as a key

In [None]:
s['a']

The `.loc` attribute can also be used to access the values by label index

In [None]:
s.loc['a']

or 0-based index using the `.iloc` attribute

In [None]:
s.iloc[0]

It is also possible to do **slicing** (We'll see more of it latter)

In [None]:
s['b':'d']

But be careful, indexes are not necessarily unique.

In [None]:
s_not_unique_idx = pd.Series(1.0, index=tuple('olivia'))
s_not_unique_idx['i']

- when index is **unique**, pandas use a hashtable to map key to value: O(1)
- when index is **non-unique and sorted**, pandas use binary search: O(logN)
- when index is **non-unique and random ordered**, pandas need to check all the keys in the index: O(N)

Let us then check the access time for these cases:

In [None]:
idx_unique = 'defghabcijkqrstuvwlmnopxyz'

print('index length', len(idx_unique), ' (unique: ', len(set(idx_unique)), ')')

# the values are not important, so we can use the same value for all
s = pd.Series(1.0, index=tuple(idx_unique))

print("Unique & unsorted")
%timeit s['z']

print("Unique & sorted")
s.sort_index(inplace=True)
%timeit s['z']

In [None]:
idx_not_unique = "oliviaoliviaoliviaoliviaol"
print('index length', len(idx_not_unique), ' (unique: ', len(set(idx_not_unique)), ')')

s = pd.Series(1.0, index=tuple(idx_not_unique))

print("Not unique & unsorted")
%timeit s['a']

print("Not unique & sorted")
s.sort_index(inplace=True)
%timeit s['a']

In the above examples, note the use of the `inplace=True` parameter to modify the original Series. By default, the `inplace` parameter is `False` and a new Series is returned.

## NaN
`NaN` (*not a number*) is the standard missing data marker used in pandas. `NaN` is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.



In [None]:
d = {'a' : 1, 'b' : 2, 'c': 3}

pd.Series(d, index=tuple('abz'))

## Get series values types
The type of the values can be obtained via `dtype`.


In [None]:
s

In [None]:
# as of pandas 0.24.0, the default dtype is 'float64'
s.dtype

And the values themselve

In [None]:
# array of values as a numpy array 
print(type(s.values))

In [None]:
# as a PandasArray
print(type(s.array))
s.array

In [None]:
# also as a numpy array
print(type(s.array))
s.to_numpy()

## Slicing
Series acts very similarly to a ndarray. 
However, operations such as slicing will also slice the index.

In [None]:
s = pd.Series(np.random.normal(size=5), index=tuple('abcde'))
s

Get the indeces that satisfy some filter

In [None]:
s>0

Given the filter, you can get the values

In [None]:
s[s>0]

Another example

In [None]:
s[s>s.median()]

To get values given the 0-indeces use `iloc` 

In [None]:
s.iloc[[0, 1, 2]]

To get values given the label indeces use `loc` or key access with `[]`

In [None]:
s[['a', 'c', 'e']]

## Modifying data

To change the series' values you can use the same indexing access

In [None]:
s["a"] = 10
s

In [None]:
s.loc["b"] = 20
s

In [None]:
s.iloc[2:] = 30
s

In [None]:
s.iloc[-1] = 40
s

## Exercise
Before running them, try to guess the output of the followiong lines 

In [None]:
s1 = pd.Series(np.random.randn(10), index=list(range(0, 50, 5)))
s1

In [None]:
s1[5]

In [None]:
s1.loc[5]

In [None]:
s1.loc[:5]

In [None]:
# !
s1[:5]

In [None]:
s1.iloc[:5]

In [None]:
s1.iloc[5]

## Operations
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [None]:
r = pd.Series([1, 2, 3, 4, 5], index=tuple('abcde'))
s = pd.Series([1, 2, 3, 4, 5], index=tuple('edcba'))

In [None]:
r + s

In [None]:
2 * r

In [None]:
r ** 2

In [None]:
2 ** r

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. 

In [None]:
s[1:] + s [:-1]

similar

In [None]:
s.iloc[1:] + s.iloc[:-1]

## Describe & Visualize data

In [None]:
s = pd.Series(np.random.normal(size=10000))
s.head(10)

In [None]:
s.info()

In [None]:
s.describe()

You can do several kinds of plot such as line, bar, box, etc. (see https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)

In [None]:
s.plot(figsize=(25, 10))

In [None]:
s.plot(kind='box')

In [None]:
s.plot(kind='hist', bins=50)

In [None]:
s.plot(kind='hist', bins=50, cumulative=True)

In [None]:
s.plot(kind='kde')