<a href="https://colab.research.google.com/github/hemu2014/python-data-test/blob/main/1%20-%20Pandas%20-%20Series%E2%80%94%E2%80%9433.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Pandas - Series


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on!

In [32]:
import pandas as pd
import numpy as np

## Pandas Series

We'll start analyzing "[The Group of Seven](https://en.wikipedia.org/wiki/Group_of_Seven)". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a `pandas.Series` object.

In [34]:
# In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

In [33]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


Someone might not know we're representing population in millions of inhabitants. Series can have a `name`, to better document the purpose of the Series:

In [36]:
g7_pop.name = 'G7 Population in millions'

In [35]:
g7_pop

Unnamed: 0,0
0,35.467
1,63.951
2,80.94
3,60.665
4,127.061
5,64.511
6,318.523


Series are pretty similar to numpy arrays:

In [37]:
g7_pop.dtype

dtype('float64')

In [38]:
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

They're actually backed by numpy arrays:

In [39]:
type(g7_pop.values)

numpy.ndarray

And they _look_ like simple Python lists or Numpy Arrays. But they're actually more similar to Python `dict`s.

A Series has an `index`, that's similar to the automatic index assigned to Python's lists:

In [40]:
g7_pop

Unnamed: 0,G7 Population in millions
0,35.467
1,63.951
2,80.94
3,60.665
4,127.061
5,64.511
6,318.523


In [41]:
g7_pop[0]

np.float64(35.467)

In [42]:
g7_pop[1]

np.float64(63.951)

In [43]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In [44]:
l = ['a', 'b', 'c']

But, in contrast to lists, we can explicitly define the index:

In [55]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [45]:
g7_pop

Unnamed: 0,G7 Population in millions
0,35.467
1,63.951
2,80.94
3,60.665
4,127.061
5,64.511
6,318.523


Compare it with the [following table](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing):

<img width="350" src="https://user-images.githubusercontent.com/872296/38149656-b5ce9816-3431-11e8-88e4-195756e25355.png" />

We can say that Series look like "ordered dictionaries". We can actually create Series out of dictionaries:

In [46]:
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [54]:
pd.Series(
    [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
    index=['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
    name='G7 Population in millions')

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


You can also create Series out of other series, specifying indexes:

这个不会

In [48]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

Unnamed: 0,G7 Population in millions
France,
Germany,
Italy,
Spain,


访问不存在的索引会报错

In [49]:
g7_pop.loc[['France', 'Germany', 'Italy', 'Spain']]

KeyError: "None of [Index(['France', 'Germany', 'Italy', 'Spain'], dtype='object')] are in the [index]"

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Indexing

Indexing works similarly to lists and dictionaries, you use the **index** of the element you're looking for:

In [59]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [58]:
g7_pop['Canada']

np.float64(35.467)

In [56]:
g7_pop['Japan']

np.float64(127.061)

Numeric positions can also be used, with the `iloc` attribute:

In [57]:
g7_pop.iloc[0]

np.float64(35.467)

这发生是因为你正在尝试使用整数索引（ 0 ）通过 .loc 访问 Series 元素，但 Series 索引已被更改为国家名称（字符串）。 g7_pop.loc[0] 尝试查找索引标签 0 但它不存在。当前的索引标签是 'Canada'、'France' 等。

In [62]:
g7_pop.loc[0]

KeyError: 0

In [60]:
g7_pop.iloc[-1]

np.float64(318.523)

Selecting multiple elements at once:

In [61]:
g7_pop[['Italy', 'France']]

Unnamed: 0,G7 Population in millions
Italy,60.665
France,63.951


_(The result is another Series)_

In [63]:
g7_pop.iloc[[0, 1]]

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951


Slicing also works, but **important**, in Pandas, the upper limit is also included:

In [64]:
g7_pop['Canada': 'Italy']

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Conditional selection (boolean arrays)

The same boolean array techniques we saw applied to numpy arrays can be used for Pandas `Series`:

In [65]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [66]:
g7_pop > 70

Unnamed: 0,G7 Population in millions
Canada,False
France,False
Germany,True
Italy,False
Japan,True
United Kingdom,False
United States,True


类似于numpy数组

In [67]:
g7_pop[g7_pop > 70]

Unnamed: 0,G7 Population in millions
Germany,80.94
Japan,127.061
United States,318.523


In [68]:
g7_pop.mean()

np.float64(107.30257142857144)

In [69]:
g7_pop[g7_pop > g7_pop.mean()]

Unnamed: 0,G7 Population in millions
Japan,127.061
United States,318.523


In [70]:
g7_pop.std()

97.24996987121581

~ not
| or
& and

In [None]:
~ not
| or
& and

In [73]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop < g7_pop.mean() + g7_pop.std() / 2)]

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Operations and methods
Series also support vectorized operations and aggregation functions as Numpy:

In [74]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [75]:
g7_pop * 1_000_000

Unnamed: 0,G7 Population in millions
Canada,35467000.0
France,63951000.0
Germany,80940000.0
Italy,60665000.0
Japan,127061000.0
United Kingdom,64511000.0
United States,318523000.0


In [76]:
g7_pop.mean()

np.float64(107.30257142857144)

求对数（以e为底），上一步并没有对原series进行修改

In [77]:
np.log(g7_pop)

Unnamed: 0,G7 Population in millions
Canada,3.568603
France,4.158117
Germany,4.393708
Italy,4.105367
Japan,4.844667
United Kingdom,4.166836
United States,5.763695


In [78]:
g7_pop['France': 'Italy'].mean()

np.float64(68.51866666666666)

In [79]:
g7_pop['France': 'Italy']

Unnamed: 0,G7 Population in millions
France,63.951
Germany,80.94
Italy,60.665


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Boolean arrays
(Work in the same way as numpy)

In [80]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [81]:
g7_pop > 80

Unnamed: 0,G7 Population in millions
Canada,False
France,False
Germany,True
Italy,False
Japan,True
United Kingdom,False
United States,True


In [82]:
g7_pop[g7_pop > 80]

Unnamed: 0,G7 Population in millions
Germany,80.94
Japan,127.061
United States,318.523


In [83]:
g7_pop[(g7_pop > 80) | (g7_pop < 40)]

Unnamed: 0,G7 Population in millions
Canada,35.467
Germany,80.94
Japan,127.061
United States,318.523


In [84]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]

Unnamed: 0,G7 Population in millions
Germany,80.94
Japan,127.061


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Modifying series


In [85]:
g7_pop['Canada'] = 40.5

In [86]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,40.5
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


In [87]:
g7_pop.iloc[-1] = 500

In [88]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,40.5
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,500.0


In [92]:
g7_pop["France"] = 64.124
g7_pop

Unnamed: 0,G7 Population in millions
Canada,40.5
France,64.124
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,500.0


In [93]:
g7_pop[g7_pop < 70]

Unnamed: 0,G7 Population in millions
Canada,40.5
France,64.124
Italy,60.665
United Kingdom,64.511


In [94]:
g7_pop[g7_pop < 70] = 99.99

In [95]:
g7_pop

Unnamed: 0,G7 Population in millions
Canada,99.99
France,99.99
Germany,80.94
Italy,99.99
Japan,127.061
United Kingdom,99.99
United States,500.0


In [96]:
g7_pop[g7_pop == "Canada"]

Unnamed: 0,G7 Population in millions


注意数据类型，一个是列的数据，一个是索引

In [97]:
g7_pop == "Canada"

Unnamed: 0,G7 Population in millions
Canada,False
France,False
Germany,False
Italy,False
Japan,False
United Kingdom,False
United States,False


In [98]:
g7_pop == 50

Unnamed: 0,G7 Population in millions
Canada,False
France,False
Germany,False
Italy,False
Japan,False
United Kingdom,False
United States,False


In [99]:
g7_pop == 99.990

Unnamed: 0,G7 Population in millions
Canada,True
France,True
Germany,False
Italy,True
Japan,False
United Kingdom,True
United States,False


In [102]:
g7_pop[g7_pop == 99.990] = 10
g7_pop

Unnamed: 0,G7 Population in millions
Canada,10.0
France,10.0
Germany,80.94
Italy,10.0
Japan,127.061
United Kingdom,10.0
United States,500.0


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
