### <font color='blue'>读取 csv 文件</font>

In [None]:
df = pd.read_csv("data/weather_data.csv")
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [7]:
df.shape

(6, 4)

## <font color="red"><h4 align="center">2. 数据操作</h4></font>

主要集中在 Series 和 DataFrame 上，它们比 Panel 使用更为普遍。

Python 和 NumPy 的索引操作 [] 使用很方便，不过因为提前不知道数据的类型，所以使用这种标准操作存在优化限制。所以，请尽量使用 Pandas 独有的方法。

Pandas 目前支持三种多维索引方法：
- **.loc**, 基于 label
- **.iloc**, 基于 position [0, length-1]


### <font color="blue">2.1 使用 [ ] 选择数据</font>

<table border="1">
<tr>
<th>数据类型</th>
<th>选择语法</th>
<th>返回值类型</th>
</tr>
<tr>
<td>Series</td>
<td>series[label]</td>
<td>scalar value</td>
</tr>
<tr>
<td>DataFrame</td>
<td>frame[colname]</td>
<td>和colname对应的Series</td>
</tr>
<tr>
<td>Panel</td>
<td>panel[itemname]</td>
<td>和itemname对应的 DataFrame</td>
</tr>
</table>

 

首先，定义我们需要的数据。

由于 Pandas 决定放弃Panel 的使用，所以下面不演示 Panel。

In [2]:
dates = pd.date_range('7/6/2017', periods=8)
# index 对应行，columns定义列，即标题
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2017-07-06,-0.850651,-0.477734,-1.588885,0.245086
2017-07-07,0.648335,-1.008562,0.98698,1.136734
2017-07-08,0.570197,0.545597,-0.81503,1.584542
2017-07-09,-1.673372,-0.673596,-0.711874,-0.976931
2017-07-10,-1.4012,0.677008,-2.046262,-1.787886
2017-07-11,-0.704555,1.036348,-1.477565,0.945378
2017-07-12,0.173839,-0.232074,-0.030759,0.707334
2017-07-13,0.41935,1.0622,-0.554513,-0.21773


In [27]:
# 获得 'A' 这一列
s = df['A']
s

2017-07-06   -0.850651
2017-07-07    0.648335
2017-07-08    0.570197
2017-07-09   -1.673372
2017-07-10   -1.401200
2017-07-11   -0.704555
2017-07-12    0.173839
2017-07-13    0.419350
Freq: D, Name: A, dtype: float64

In [28]:
# 和 df['A'] 效果相同
df.A

2017-07-06   -0.850651
2017-07-07    0.648335
2017-07-08    0.570197
2017-07-09   -1.673372
2017-07-10   -1.401200
2017-07-11   -0.704555
2017-07-12    0.173839
2017-07-13    0.419350
Freq: D, Name: A, dtype: float64

In [6]:
# 获得第 6 个值
s[dates[5]]

-0.70455509146770023

In [11]:
# 将 A 和 B两列交换
df[['B', 'A']] = df[['A', 'B']]

In [10]:
df

Unnamed: 0,A,B,C,D
2017-07-06,-0.477734,-0.850651,-1.588885,0.245086
2017-07-07,-1.008562,0.648335,0.98698,1.136734
2017-07-08,0.545597,0.570197,-0.81503,1.584542
2017-07-09,-0.673596,-1.673372,-0.711874,-0.976931
2017-07-10,0.677008,-1.4012,-2.046262,-1.787886
2017-07-11,1.036348,-0.704555,-1.477565,0.945378
2017-07-12,-0.232074,0.173839,-0.030759,0.707334
2017-07-13,1.0622,0.41935,-0.554513,-0.21773


In [12]:
# 选择 A 和 B 这两列
df[['A', 'B']]

Unnamed: 0,A,B
2017-07-06,-0.850651,-0.477734
2017-07-07,0.648335,-1.008562
2017-07-08,0.570197,0.545597
2017-07-09,-1.673372,-0.673596
2017-07-10,-1.4012,0.677008
2017-07-11,-0.704555,1.036348
2017-07-12,0.173839,-0.232074
2017-07-13,0.41935,1.0622


最好的切片选择是使用 **.iloc** 方法，不过使用 [] 也可以

In [14]:
# 获得s的前5项
s[:5]

2017-07-06   -0.850651
2017-07-07    0.648335
2017-07-08    0.570197
2017-07-09   -1.673372
2017-07-10   -1.401200
Freq: D, Name: A, dtype: float64

In [15]:
# 每两个选一个，即隔一个选一个
s[::2]

2017-07-06   -0.850651
2017-07-08    0.570197
2017-07-10   -1.401200
2017-07-12    0.173839
Freq: 2D, Name: A, dtype: float64

In [16]:
# 倒着选择全部
s[::-1]

2017-07-13    0.419350
2017-07-12    0.173839
2017-07-11   -0.704555
2017-07-10   -1.401200
2017-07-09   -1.673372
2017-07-08    0.570197
2017-07-07    0.648335
2017-07-06   -0.850651
Freq: -1D, Name: A, dtype: float64

In [17]:
# 复制s，将s2的前5项设置为0
s2 = s.copy()
s2[:5] = 0

In [18]:
s2

2017-07-06    0.000000
2017-07-07    0.000000
2017-07-08    0.000000
2017-07-09    0.000000
2017-07-10    0.000000
2017-07-11   -0.704555
2017-07-12    0.173839
2017-07-13    0.419350
Freq: D, Name: A, dtype: float64

In [13]:
# 选择 1-2 这两行
df[1:3]

Unnamed: 0,A,B,C,D
2017-07-07,0.648335,-1.008562,0.98698,1.136734
2017-07-08,0.570197,0.545597,-0.81503,1.584542


In [19]:
# 选择 DataFrame 的前3行
df[:3]

Unnamed: 0,A,B,C,D
2017-07-06,-0.850651,-0.477734,-1.588885,0.245086
2017-07-07,0.648335,-1.008562,0.98698,1.136734
2017-07-08,0.570197,0.545597,-0.81503,1.584542


In [20]:
# 倒着选择 DatFrame 全部
df[::-1]

Unnamed: 0,A,B,C,D
2017-07-13,0.41935,1.0622,-0.554513,-0.21773
2017-07-12,0.173839,-0.232074,-0.030759,0.707334
2017-07-11,-0.704555,1.036348,-1.477565,0.945378
2017-07-10,-1.4012,0.677008,-2.046262,-1.787886
2017-07-09,-1.673372,-0.673596,-0.711874,-0.976931
2017-07-08,0.570197,0.545597,-0.81503,1.584542
2017-07-07,0.648335,-1.008562,0.98698,1.136734
2017-07-06,-0.850651,-0.477734,-1.588885,0.245086


### <font color="blue">2.2 使用 .loc 选择数据 (label)</font>

In [26]:
df1 = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'), index=pd.date_range('20130101', periods=5))
df1

Unnamed: 0,A,B,C,D
2013-01-01,2.031166,0.857202,1.336376,-1.050434
2013-01-02,-1.110988,-0.286514,0.345784,-0.428011
2013-01-03,0.313629,-0.092454,-0.143808,1.355029
2013-01-04,1.281271,1.605933,-0.593882,0.072949
2013-01-05,-0.039043,1.1185,0.07308,-0.005441


In [29]:
# 范围选择
df1['20130101':'20130104']

Unnamed: 0,A,B,C,D
2013-01-01,2.031166,0.857202,1.336376,-1.050434
2013-01-02,-1.110988,-0.286514,0.345784,-0.428011
2013-01-03,0.313629,-0.092454,-0.143808,1.355029
2013-01-04,1.281271,1.605933,-0.593882,0.072949


In [31]:
# 不忘忘了加 loc，否则报错
df1.loc['20130101']

A    2.031166
B    0.857202
C    1.336376
D   -1.050434
Name: 2013-01-01 00:00:00, dtype: float64

### <font color="blue">2.3 使用 .iloc 选择数据</font>

## <font color="red"><h4 align="center">3. 属性</h4></font>

## <font color="red"><h4 align="center">4. 方法</h4></font>

In [3]:
maxvalue = df['Temperature'].max()
assert maxvalue == 50

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8]) # np.nan 表示 NaN, not a number
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

以 numpy array 作为参数创建 `DataFrame`

In [4]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,1.081257,-0.583521,-0.956591,-1.15023
2013-01-02,0.598011,0.044001,0.805043,-1.708032
2013-01-03,0.497438,-0.879815,-1.790328,0.317809
2013-01-04,1.080999,-0.991755,1.326935,-1.352318
2013-01-05,0.016344,-0.033948,0.766915,1.387153
2013-01-06,-0.036608,0.181089,0.460195,0.097629


Creating a `DataFrame` by passing a dict of objects that can be converted to series-like.

In [6]:
df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [3]:
# A structured array
my_array = np.ones(3, dtype=([('foo', int), ('bar', float)]))

### <font color="blue">2.1 行操作</font>

### DataFrame.head(n=5)
获得前面n 行数据，n默认值为5.

In [8]:
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


### DataFrame.tail(n=5)

获得末尾的 n 行数据，默认值为5

In [9]:
df.tail()

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
