![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Pandas - Series


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Hands on! - The Group Of Seven example

We'll start analyzing "[The Group of Seven](https://en.wikipedia.org/wiki/Group_of_Seven)". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a `pandas.Series` object.

<img width="350" src="https://user-images.githubusercontent.com/872296/38149656-b5ce9816-3431-11e8-88e4-195756e25355.png" />

^^^ We will create the frame which will represent above (step by step)

In [3]:
import pandas as pd
import numpy as np

In [5]:
# recall numpy creation

lst = [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523]  # In millions
arr = np.array(lst)
arr

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [8]:
# creating series quite similar

lst = [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523]  # In millions
g7_pop = pd.Series(lst)
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Note that above series has default range index of index by default. We will assign string index soon.

And also that series does NOT have name yet. Someone might not know we're representing population in millions of inhabitants. Series can have a `name`, to better document the purpose of the Series:

In [10]:
g7_pop.name = "G7 Population in millions"
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

# Series are pretty similar to numpy arrays:

They're actually backed by numpy arrays:

In [12]:
g7_pop.dtype

dtype('float64')

In [14]:
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [15]:
type(g7_pop.values)

numpy.ndarray

In [17]:
g7_pop.values.mean(), g7_pop.values.sum(), g7_pop.values.max()

(107.30257142857144, 751.118, 318.523)

# Index

Indexing works similarly to lists and dictionaries, you use the **index** of the element you're looking for:

In [23]:
g7_pop  # recall

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

## numerical index

so far we do have integer range index

In [19]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In [20]:
g7_pop[4], g7_pop[1], g7_pop[0], g7_pop[2]

(127.061, 63.951, 35.467, 80.94)

In [22]:
g7_pop[1:-1]

1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
Name: G7 Population in millions, dtype: float64

## string Index

And they _look_ like simple Python lists or Numpy Arrays. But they're actually more similar to Python `dict`s.
A Series has an `index`, that's similar to the automatic index assigned to Python's lists:
But, in contrast to lists, we can explicitly define the index:


In [25]:
g7_pop.index = [
    "Canada",
    "France",
    "Germany",
    "Italy",
    "Japan",
    "United Kingdom",
    "United States",
]  # String index possible, which is really similar to python dict - keys

g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [26]:
g7_pop[4], g7_pop["Japan"]  # both number and string indexing possible

(127.061, 127.061)

In [27]:
g7_pop[["Germany", "France", "Italy"]]

Germany    80.940
France     63.951
Italy      60.665
Name: G7 Population in millions, dtype: float64

### key error case or not

In [33]:
g7_pop[["Spain", "Germany", "France", "Italy"]]  # error. We expect key error right?

KeyError: "['Spain'] not in index"

HOWEVER below won't result error even though with Spain index again. Why?

In [34]:
pd.Series(g7_pop, index=["Spain", "Germany", "France", "Italy"])  # this is re-creation

Spain         NaN
Germany    80.940
France     63.951
Italy      60.665
Name: G7 Population in millions, dtype: float64

## loc

In [None]:
g7_pop.loc["Canada"], g7_pop["Canada"]

(35.467, 35.467)

In [35]:
g7_pop.loc["Italy"], g7_pop["Italy"]

(60.665, 60.665)

we are going to learn later. we can use conditional indexing using `loc`

## iloc

Numeric positions can also be used, with the `iloc` attribute:

In [38]:
g7_pop.iloc[2]  # only integer possible

80.94

In [40]:
g7_pop.iloc["Italy"]  # error. KEY_ERROR X,  TYPE_ERROR

TypeError: Cannot index by location index with a non-integer key

In [44]:
g7_pop.iloc[[0, 1, 1, 2, 2, 2, 1]]

Canada     35.467
France     63.951
France     63.951
Germany    80.940
Germany    80.940
Germany    80.940
France     63.951
Name: G7 Population in millions, dtype: float64

In [45]:
g7_pop[1:4]  # range index using integer

France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

In [48]:
g7_pop.iloc[1:4]  # range index using integer

France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

Slicing also works, but **important**, in Pandas, the upper limit is also included:

### more indexing examples

try below and quiz yourself

```python
g7_pop[:5]
g7_pop[4:6]
g7_pop["Canada":"Japan"]
g7_pop["Japan":"Canada"]


---
## Conditional selection (boolean arrays)

The same boolean array techniques we saw applied to numpy arrays can be used for Pandas `Series`:

In [None]:
g7_pop  # recall

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop[g7_pop > 70]

Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop[g7_pop > 70].mean()

236.00033333333332

In [None]:
g7_pop.mean()

133.94685714285714

In [None]:
g7_pop[g7_pop > g7_pop.mean()]

United States    500.0
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop.std(), g7_pop.mean(), g7_pop.max(), g7_pop.min(), g7_pop.std()

(163.6435910237332, 133.94685714285714, 500.0, 40.5, 163.6435910237332)

In [None]:
g7_pop.describe()  # summary of statistics

count      7.000000
mean     133.946857
std      163.643591
min       40.500000
25%       62.308000
50%       64.511000
75%      104.000500
max      500.000000
Name: G7 Population in millions, dtype: float64

#### multi-conditions using OR AND.
| or

& and

In [None]:
g7_pop[
    (g7_pop > g7_pop.mean() - g7_pop.std() / 2)
    | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)
]

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop[(g7_pop > 80) | (g7_pop < 40)]

Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]

Germany     80.940
Japan      127.061
Name: G7 Population in millions, dtype: float64

#### Using NOT
- ~ not

In [None]:
g7_pop[(g7_pop > 80)]

Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop[~(g7_pop > 80)]

Canada            40.500
France            63.951
Italy             60.665
United Kingdom    64.511
Name: G7 Population in millions, dtype: float64

---
## Operations and methods (modifying series)
Series also support vectorized operations and aggregation functions as Numpy:

In [None]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [None]:
g7_pop * 1_000_000  # broadcasting

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [None]:
np.log(g7_pop)  # broadcasting

Canada            3.568603
France            4.158117
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
Name: G7 Population in millions, dtype: float64

### when update suceed

In [88]:
g7_pop["Canada"] = 40.5
g7_pop  # check if Canada has 40.5

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [89]:
g7_pop.iloc[-1] = 500
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

### when update not succeed

In [51]:
g7_pop["Canada"] * 100

4050.0

In [52]:
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [53]:
g7_pop["Canada"] = g7_pop["Canada"] * 100  # this one has `=` sign which will work
g7_pop

Canada            4050.000
France              63.951
Germany             80.940
Italy               60.665
Japan              127.061
United Kingdom      64.511
United States      500.000
Name: G7 Population in millions, dtype: float64

In [54]:
g7_pop["Canada"] = 40.5  # roll back
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

---

## Making Complete Series with one code cell

Compare it with the [following table](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing): 

<img width="350" src="https://user-images.githubusercontent.com/872296/38149656-b5ce9816-3431-11e8-88e4-195756e25355.png" />


In [None]:
pd.Series(
    data=[35, 63, 80, 60, 127, 64, 318],
    index=[
        "Canada",
        "France",
        "Germany",
        "Italy",
        "Japan",
        "United Kingdom",
        "United States",
    ],
)

Canada             35
France             63
Germany            80
Italy              60
Japan             127
United Kingdom     64
United States     318
dtype: int64

In [92]:
pd.Series(
    data=[35, 63, 80, 60, 127, 64, 318],
    index=[
        "Canada",
        "France",
        "Germany",
        "Italy",
        "Japan",
        "United Kingdom",
        "United States",
    ],
    name="G7 Population in millions",
    dtype=float,
)

Canada             35.0
France             63.0
Germany            80.0
Italy              60.0
Japan             127.0
United Kingdom     64.0
United States     318.0
Name: G7 Population in millions, dtype: float64

In [90]:
# creating using dictionary

pd.Series(
    {
        "Canada": 35.467,
        "France": 63.951,
        "Germany": 80.94,
        "Italy": 60.665,
        "Japan": 127.061,
        "United Kingdom": 64.511,
        "United States": 318.523,
    },
    name="G7 Population in millions",
)

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64