![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39118512-dfa1cc1a-46e9-11e8-9547-093d4532451e.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Intro to Pandas Series

A Series is a one-dimensional array-like object containing a _typed_ sequence of values and an associated array of data labels, called its _index_.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on!


In [1]:
import numpy as np
import pandas as pd

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Series creation

`pd.Series`' constructor accepts the following parameters:

- **data**: (required) has all the data we want to store on the Series and could be an scalar value, a Python sequence or an unidimensional NumPy ndarray.
- **index**: (optional), has all the labels that we want to assign to our data values and could be a Python sequence or an unidimensional NumPy ndarray. Default value: `np.arange(0, len(data))`.
- **dtype**: (optional) any NumPy data type.

In [2]:
series = pd.Series([1, 2, 3, 4, 5])
series

0    1
1    2
2    3
3    4
4    5
dtype: int64

Series have an associated type:

In [3]:
# Show first values of our Series
series.head()

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]:
series.dtype

dtype('int64')

In [5]:
series = pd.Series([1, 2, 3, 4, 5], dtype=np.float)
series

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [6]:
series.dtype

dtype('float64')

In [7]:
series = pd.Series(['a', 'b', 'c', 'd', 'e'])
series

0    a
1    b
2    c
3    d
4    e
dtype: object

In [8]:
# Using a ndarraynp.array([2, 4, 6, 8, 10
array = np.array([2, 4, 6, 8, 10])
series = pd.Series(array)
series

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [9]:
# With predefined index
series = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
series

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [10]:
# Using a dictionary (index will be defined using keys)
series = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}, dtype=np.float64)
series

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Series attributes

These are the most common attributes to get information about a `Series`:

In [11]:
series = pd.Series(data=[1, 2, 3, 4, 5],
                   index=['a', 'b', 'c', 'd', 'e'],
                   dtype=np.float64)
series

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

In [12]:
# Type of our Series
series.dtype

dtype('float64')

In [13]:
# Values of a series
series.values

array([1., 2., 3., 4., 5.])

In [14]:
type(series.values)

numpy.ndarray

In [15]:
# Index of a series
series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [16]:
# Dimension of the Series
series.ndim

1

In [17]:
# Shape of the Series
series.shape

(5,)

In [18]:
# Number of Series elements
series.size

5

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## The Group of Seven

We'll start analyzing "[The Group of Seven](https://en.wikipedia.org/wiki/Group_of_Seven)". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a `pandas.Series` object.

In [19]:
# In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Someone might not know we're representing population in millions of inhabitants. Series can have a `name`, to better document the purpose of the Series:

In [20]:
g7_pop.name = 'G7 Population in millions'

g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

Series are pretty similar to numpy arrays:

In [21]:
g7_pop.dtype

dtype('float64')

In [22]:
type(series.values)

numpy.ndarray

In [23]:
g7_pop.ndim

1

In [24]:
g7_pop.shape

(7,)

In [25]:
g7_pop.size

7

And they _look_ like simple Python lists or Numpy Arrays. But they're actually more similar to Python `dict`s.

In [26]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [27]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

### Assigning `Series` indexes

In contrast to lists, we can explicitly define the index:

In [28]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [29]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Compare it with the [following table](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing): 

![image](https://user-images.githubusercontent.com/872296/38149656-b5ce9816-3431-11e8-88e4-195756e25355.png)

### Removing indexes

We can also remove current indexes from our `Series`, going back to the original indexes. To do that we use the `reset_index()` method with `drop=True` parameter:

In [30]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [31]:
g7_pop.reset_index(drop=True)

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [32]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Note that `reset_index()` will return a new `Series`, so if we want to keep it we need to assign it to a variable, or use `inplace=True` parameter to modify the original `Series`.

In [33]:
g7_pop.reset_index(drop=True, inplace=True)

In [34]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

### Creating a `Series` with indexes already

We can create a new `Series` with its indexes labels in a single step:

In [35]:
values = [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523]
indexes = ['Canada', 'France', 'Germany', 'Italy',
           'Japan', 'United Kingdom', 'United States']

pd.Series(values,
          index=indexes,
          name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

### Creating a `Series` from a data dictionary
We can say that Series look like "ordered dictionaries". We can actually create Series out of dictionaries:

In [36]:
data_dic = {
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}

g7_pop = pd.Series(data_dic,
                   name='G7 Population in millions')

In [37]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

### Creating a `Series` out of other `Series`

You can also create Series out of other series, specifying indexes:

In [38]:
pd.Series(g7_pop,
          index=['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39118512-dfa1cc1a-46e9-11e8-9547-093d4532451e.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Pandas Series - Selection and Indexing

Pandas Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these data structures.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on!

In [1]:
import pandas as pd
import numpy as np

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

The first thing we'll do is create again the `Series` from our previous lecture:

In [2]:
data_dic = {
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}

g7_pop = pd.Series(data_dic,
                   name='G7 Population in millions')

In [3]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Indexing

Indexing works similarly to lists and dictionaries.

### Indexing by index

you use the **index** of the element you're looking for:

In [4]:
g7_pop['Canada']

35.467

In [5]:
g7_pop['Japan']

127.061

In [6]:
g7_pop['United Kingdom']

64.511

The following also works, but it's **NOT** recommended:

In [7]:
g7_pop.Japan

127.061

### Slicing and multi-selection

Slicing also works, but **important**, in Pandas, the upper limit is also included:

In [8]:
g7_pop['Germany': 'Japan']

Germany     80.940
Italy       60.665
Japan      127.061
Name: G7 Population in millions, dtype: float64

Multi indexing also works (similarly to numpy):

In [9]:
g7_pop[['Italy', 'France', 'United States']]

Italy             60.665
France            63.951
United States    318.523
Name: G7 Population in millions, dtype: float64

### Indexing by sequential position

Indexing elements by their sequential position also works. In this case pandas evaluates the object received; if it doesn't exist as an index, it'll try by sequential position.

With sequential position the upper limit is not included.

In [10]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [11]:
g7_pop.iloc[0] # First element

35.467

In [12]:
g7_pop.iloc[-1] # Last element

318.523

Other examples:

In [13]:
g7_pop.iloc[2]

80.94

In [14]:
g7_pop.iloc[4]

127.061

In [15]:
g7_pop.iloc[2:4]

Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

In [16]:
g7_pop.iloc[[3, 1, 6]]

Italy             60.665
France            63.951
United States    318.523
Name: G7 Population in millions, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Adding new elements to a `Series`

In many cases we'll want to add new values to our `Series`, to do that we can just simply index our `Series` using the new index and then assigning a value to that index. Let's add two new records:

In [17]:
g7_pop['Brazil'] = 20.124
g7_pop['India'] = 32.235

In [18]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Brazil             20.124
India              32.235
Name: G7 Population in millions, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Modifying `Series` elements

In [19]:
g7_pop['Canada'] = 40.5

g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Brazil             20.124
India              32.235
Name: G7 Population in millions, dtype: float64

In [20]:
g7_pop['France'] = np.nan

g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Brazil             20.124
India              32.235
Name: G7 Population in millions, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Removing elements from a `Series`

In [21]:
del g7_pop['Brazil']

g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
India              32.235
Name: G7 Population in millions, dtype: float64

In [22]:
del g7_pop['India']

g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

### Checking existance of a key (membership)

In [23]:
'France' in g7_pop

True

In [24]:
'Brazil' in g7_pop

False

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Introducing **`loc`** & **`iloc`**

What's the problem with the indexing we've seen? It's not explicit. Pandas receives an element to index and it tries figuring out if we meant to select an element by its key, or its sequential position. Check out the following example:

In [25]:
s = pd.Series(
    ['a', 'b', 'c'],
    index=[1, 2, 3])
s

1    a
2    b
3    c
dtype: object

In [26]:
s

1    a
2    b
3    c
dtype: object

What happens if we try indexing `s[1]`, what should it return? `a` or `b`?

In [27]:
s[1]

'a'

In this case, the returned object is worked out by the index, not by the sequential position. But again, it's not intuitive or explicit.

Enter `loc` and `iloc`:
* `loc` is the preferred way to select elements in Series (and Dataframes) by their index
* `iloc` is the preferred way to select by sequential position

In [28]:
s.loc[1]

'a'

In [29]:
s.iloc[1]

'b'

In [30]:
g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [31]:
g7_pop.iloc[-1]

318.523

In [32]:
g7_pop.iloc[[0, 1]]

Canada    40.5
France     NaN
Name: G7 Population in millions, dtype: float64

Using our previous series:

In [33]:
g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [34]:
g7_pop.loc['Japan']

127.061

In [35]:
g7_pop.iloc[-1]

318.523

In [36]:
g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [37]:
g7_pop.loc['Canada']

40.5

In [38]:
g7_pop.iloc[0]

40.5

In [39]:
g7_pop.iloc[-1]

318.523

In [40]:
g7_pop.loc[['Japan', 'Canada']]

Japan     127.061
Canada     40.500
Name: G7 Population in millions, dtype: float64

In [41]:
g7_pop.iloc[[0, -1]]

Canada            40.500
United States    318.523
Name: G7 Population in millions, dtype: float64

#### **`loc`** & **`iloc`** to modify `Series`

In [42]:
g7_pop.loc['United States'] = 1000

g7_pop

Canada              40.500
France                 NaN
Germany             80.940
Italy               60.665
Japan              127.061
United Kingdom      64.511
United States     1000.000
Name: G7 Population in millions, dtype: float64

In [43]:
g7_pop.iloc[-1] = 500

g7_pop

Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Introducing to Boolean arrays

Another way to select certain values within a `Series` is using **boolean arrays**, also known as **Conditional selection**.

We can index our `Series` using a list of boolean values:

In [44]:
g7_pop[[False, False,  True, False,  True, False,  True]]

Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

Or we can index our `Series` using another `Series` with boolean values:

In [45]:
condition = pd.Series([
    False, False,  True, False,  True, False,  True
], index=[
    'Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'
])

condition

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
dtype: bool

In [46]:
g7_pop[condition]

Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

> On upcoming lectures we'll see how to use more complex **conditional selections**.

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)