## Pandas Tutorial

In [1]:
import numpy as np
import pandas as pd

In [2]:
def createDF(inputString, inputValue):
    dfSchema = dict(zip(inputString, inputValue))
    return pd.DataFrame(dfSchema)

In [3]:
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
             'Netherlands', 'Germany', 'Switzerland', 'Belarus',
             'Austria', 'France', 'Poland', 'China', 'Korea', 
             'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
             'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
             'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

inputString = ["Country", "Gold", "Silver", "Bronze"]
inputValue = [countries, gold, silver, bronze]

df = createDF(inputString, inputValue)

In [4]:
df[['Gold', 'Silver']]  # DataFrame
df['Gold']  # Series

0     13
1     11
2     10
3      9
4      8
5      8
6      6
7      5
8      4
9      4
10     4
11     3
12     3
13     2
14     2
15     2
16     1
17     1
18     1
19     1
20     1
21     0
22     0
23     0
24     0
25     0
Name: Gold, dtype: int64

In [5]:
df[df.Gold > 10]  # A better expression compared to df[df['gold'] > 10]

Unnamed: 0,Bronze,Country,Gold,Silver
0,9,Russian Fed.,13,11
1,10,Norway,11,5


In [6]:
df.loc[0]  # Return a Serie

Bronze                9
Country    Russian Fed.
Gold                 13
Silver               11
Name: 0, dtype: object

In [7]:
df.at(5)  # Return a indexing

<pandas.core.indexing._AtIndexer at 0x65e7358>

`iloc` and `iat` is Integer Based, which means row or index based. While, `loc` and `at` is Label Based

### Apply, map

1. `map(lambda x: f)`可以理解为`scala`里的`map(f: A => A)`，不改变约定的schema，**对象是`Serie`或者`Index`**
2. `apply(func,...)`本质上也是`scala`里的`map(f: A => A)`, 不改变约定的schema，**对象是`DataFrame`，并遍历每列，返回`Series`或者`DF`**
3. `applymap(func)`，返回`DataFrame`

In [8]:
df[['Gold', 'Silver']].apply(np.mean)

Gold      3.807692
Silver    3.730769
dtype: float64

In [9]:
df['Gold'].map(lambda x: x > 10)  # df.gold > 10的复杂版

0      True
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
Name: Gold, dtype: bool

In [10]:
df[df.Gold > 10]

Unnamed: 0,Bronze,Country,Gold,Silver
0,9,Russian Fed.,13,11
1,10,Norway,11,5


计算金牌4分，银牌2分，铜牌1分，生成国家，总分的DF

In [11]:
scoring = np.array([4, 2, 1])
medals = np.array([gold, silver, bronze])
points = np.dot(scoring, medals)
new_df = createDF(["Country", "Points"], [countries, points])
new_df.head(3)

Unnamed: 0,Country,Points
0,Russian Fed.,83
1,Norway,64
2,Canada,65


In [12]:
df['Points'] = df[['Gold', 'Silver', 'Bronze']].dot([4, 2, 1])
new_df_2 = df[['Country', 'Points']]
new_df_2.head(3)

Unnamed: 0,Country,Points
0,Russian Fed.,83
1,Norway,64
2,Canada,65
