# Data Science and Visualization (RUC F2024)

## Lecture 2: Exploratory Data Analysis (EDA)

# Series in pandas

A *Series* is a one-dimensional array-like object containing a sequence of **values** (of
similar types to NumPy types) and an associated array of data labels, called **index**.
The simplest Series is formed from only an array of data:

## Construction and content

In [2]:
import pandas as pd

series_1 = pd.Series([4, 7, -5, 3])
series_1

0    4
1    7
2   -5
3    3
dtype: int64

Above, the Output shows the index on the left and the values on the right. Since we did not specify an index for the data, a default index is created, consisting of the integers 0 through N-1, where N is the length of the Series object. 

We can get the array representation and index object of the Series via its values and index attributes, respectively:

In [3]:
series_1.values

array([ 4,  7, -5,  3], dtype=int64)

In [4]:
series_1.index

RangeIndex(start=0, stop=4, step=1)

We can get one of the element in two ways.

In [5]:
series_1[0]

4

In [6]:
series_1.values[0]

4

Pay attention to the value type if we change one of the element in a Series to be different from other integer values.

In [7]:
series_2 = pd.Series([4, '7', -5, 3])

In [8]:
series_2

0     4
1     7
2    -5
3     3
dtype: object

In [9]:
series_2.values

array([4, '7', -5, 3], dtype=object)

In [10]:
series_2.index

RangeIndex(start=0, stop=4, step=1)

## Compare Series List

* **Series** is defined in pandas
* **List** is defined in Python

In [10]:
list_1 = [4, 7, -5, 3]
list_1

[4, 7, -5, 3]

**NB**: A *list* object in Python has no attribute 'values':

In [11]:
list_1.values

AttributeError: 'list' object has no attribute 'values'

In [12]:
list_1.index

<function list.index(value, start=0, stop=9223372036854775807, /)>

If we have different types of values in a list, the value/data types will be specific.

In [13]:
list_2 = [4, '7', -5, 3]
list_2

[4, '7', -5, 3]

In [14]:
for i in range(len(list_2)):
    print(type(list_2[i]))

<class 'int'>
<class 'str'>
<class 'int'>
<class 'int'>


## More on Series' index

Now let's create a Series with customized index.

In [15]:
series_3 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [16]:
series_3

d    4
b    7
a   -5
c    3
dtype: int64

In [17]:
series_3.index

Index(['d', 'b', 'a', 'c'], dtype='object')

We can use labels in the index when selecting single values or a set of values:

In [18]:
series_3['d']

4

In [19]:
series_3.values[0]

4

When we get values from a Series object, if the index contains more than one element, the index itself should be formulated as a list/array.

In [20]:
series_3[['a', 'b', 'c']]

a   -5
b    7
c    3
dtype: int64

We can still use the default, integer index.

In [21]:
series_3[[2, 1, 3]]

a   -5
b    7
c    3
dtype: int64

A Series’s index can be altered *in-place* by assignment:

In [22]:
series_3.index = ['A', 'BB', 'CCC', 'DDDD']
series_3

A       4
BB      7
CCC    -5
DDDD    3
dtype: int64

## Operations on Series

Given a Series, operations like filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [23]:
series_3[series_3 > 3]

A     4
BB    7
dtype: int64

In [24]:
series_3*2

A        8
BB      14
CCC    -10
DDDD     6
dtype: int64

In [25]:
series_3/2

A       2.0
BB      3.5
CCC    -2.5
DDDD    1.5
dtype: float64

In [26]:
import numpy as np

# exp(x) = e^x, where e is Euler's number 2.718281
np.exp(series_3)

A         54.598150
BB      1096.633158
CCC        0.006738
DDDD      20.085537
dtype: float64

## Series as dict (for self-study)

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a Python *dict*:

In [27]:
'c' in series_3

False

We can create a Series from a Python by passing the dict:

In [28]:
dict = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [29]:
dict

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [30]:
series_4 = pd.Series(dict)
series_4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [31]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
series_5 = pd.Series(dict, index=states)
series_5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [32]:
series_6 = series_4 + series_5

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [33]:
# Seems it has no effect
series_6.name = 'population'

In [34]:
series_6.index.name = 'state'

In [35]:
series_6

state
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
Name: population, dtype: float64