In [1]:
import numpy as np
import pandas as pd

Роль индекса может играть, скажем, последовательность дат (или времён измерения и т.д.).

In [2]:
d=pd.date_range('20170101',periods=10)
d

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10'],
              dtype='datetime64[ns]', freq='D')

In [3]:
s=pd.Series(np.random.normal(size=10),index=d)
s

2017-01-01    0.356263
2017-01-02   -0.149695
2017-01-03    0.823284
2017-01-04    1.936065
2017-01-05    0.309854
2017-01-06    0.642161
2017-01-07    0.499560
2017-01-08    0.004974
2017-01-09    0.245381
2017-01-10    0.951140
Freq: D, dtype: float64

Операции сравнения возвращают наборы булевых данных.

In [4]:
s>0

2017-01-01     True
2017-01-02    False
2017-01-03     True
2017-01-04     True
2017-01-05     True
2017-01-06     True
2017-01-07     True
2017-01-08     True
2017-01-09     True
2017-01-10     True
Freq: D, dtype: bool

Если такой булев набор использовать для индексации, получится поднабор только из тех данных, для которых условие есть `True`.

In [5]:
s[s>0]

2017-01-01    0.356263
2017-01-03    0.823284
2017-01-04    1.936065
2017-01-05    0.309854
2017-01-06    0.642161
2017-01-07    0.499560
2017-01-08    0.004974
2017-01-09    0.245381
2017-01-10    0.951140
dtype: float64

Кумулятивные максимумы - от первого элемента до текущего.

In [6]:
s.cummax()

2017-01-01    0.356263
2017-01-02    0.356263
2017-01-03    0.823284
2017-01-04    1.936065
2017-01-05    1.936065
2017-01-06    1.936065
2017-01-07    1.936065
2017-01-08    1.936065
2017-01-09    1.936065
2017-01-10    1.936065
Freq: D, dtype: float64

Кумулятивные суммы.

In [7]:
s=s.cumsum()
s

2017-01-01    0.356263
2017-01-02    0.206568
2017-01-03    1.029852
2017-01-04    2.965918
2017-01-05    3.275771
2017-01-06    3.917933
2017-01-07    4.417493
2017-01-08    4.422467
2017-01-09    4.667848
2017-01-10    5.618988
Freq: D, dtype: float64

Построим график.

In [8]:
import matplotlib
matplotlib.use('pdf')
import matplotlib.pyplot as plt

In [9]:
plt.plot(s)
plt.savefig('b24_pandas_1.pdf')
plt.clf()

Создадим таблицу из массива случайных чисел.

In [10]:
df=pd.DataFrame(np.random.randn(10,4),
                columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,0.706305,-0.789569,-0.692519,0.340655
1,0.277662,1.168946,-0.456736,-0.824495
2,-1.742185,-1.623243,-0.188642,-0.16413
3,-0.486917,-0.404665,0.828688,1.960935
4,-0.009105,1.221108,0.399887,3.07569
5,0.440579,1.513206,0.251294,0.838322
6,1.80018,1.87725,-0.721819,-1.6379
7,0.518764,1.372829,-0.5136,-0.544454
8,-0.823112,-1.18859,1.861447,1.272845
9,1.10327,-1.380487,0.227549,0.769414


In [11]:
df2=pd.DataFrame(np.random.randn(7,3),columns=['A','B','C'])
df+df2

Unnamed: 0,A,B,C,D
0,-0.396002,-0.388489,-0.938762,
1,0.025981,-0.822861,-1.221219,
2,-1.859502,-0.265075,-1.39991,
3,-0.937928,-0.118183,-0.413946,
4,-0.038995,1.159641,2.223911,
5,0.478176,-0.599153,-1.214517,
6,1.387845,0.992897,-0.214836,
7,,,,
8,,,,
9,,,,


In [12]:
2*df+3

Unnamed: 0,A,B,C,D
0,4.41261,1.420863,1.614962,3.681311
1,3.555324,5.337893,2.086527,1.351011
2,-0.48437,-0.246485,2.622715,2.671739
3,2.026166,2.19067,4.657376,6.92187
4,2.981789,5.442216,3.799774,9.151381
5,3.881158,6.026413,3.502587,4.676644
6,6.600359,6.7545,1.556363,-0.2758
7,4.037527,5.745658,1.9728,1.911092
8,1.353775,0.622821,6.722893,5.54569
9,5.206539,0.239026,3.455098,4.538828


In [13]:
np.sin(df)

Unnamed: 0,A,B,C,D
0,0.649027,-0.71005,-0.638478,0.334105
1,0.274108,0.920339,-0.441021,-0.734205
2,-0.985349,-0.998625,-0.187526,-0.163394
3,-0.467903,-0.393711,0.737045,0.924856
4,-0.009105,0.93948,0.389314,0.065855
5,0.426463,0.998342,0.248657,0.743522
6,0.973807,0.953409,-0.660751,-0.997749
7,0.495807,0.980468,-0.491316,-0.517951
8,-0.733266,-0.927844,0.958058,0.95594
9,0.892686,-0.981946,0.22559,0.695714


In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,10.0,10.0,10.0,10.0
mean,0.178544,0.176679,0.099555,0.508688
std,1.007094,1.373741,0.805262,1.391815
min,-1.742185,-1.623243,-0.721819,-1.6379
25%,-0.367464,-1.088834,-0.499384,-0.449373
50%,0.359121,0.382141,0.019453,0.555035
75%,0.65942,1.334899,0.362739,1.164214
max,1.80018,1.87725,1.861447,3.07569


In [15]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2,-1.742185,-1.623243,-0.188642,-0.16413
9,1.10327,-1.380487,0.227549,0.769414
8,-0.823112,-1.18859,1.861447,1.272845
0,0.706305,-0.789569,-0.692519,0.340655
3,-0.486917,-0.404665,0.828688,1.960935
1,0.277662,1.168946,-0.456736,-0.824495
4,-0.009105,1.221108,0.399887,3.07569
7,0.518764,1.372829,-0.5136,-0.544454
5,0.440579,1.513206,0.251294,0.838322
6,1.80018,1.87725,-0.721819,-1.6379


Атрибут `iloc` подобен `loc`: первый индекс - номер строки, второй - номер столбца. Это целые числа, конец диапазона на включается (как обычно в питоне).

In [16]:
df.iloc[2]

A   -1.742185
B   -1.623243
C   -0.188642
D   -0.164130
Name: 2, dtype: float64

In [17]:
df.iloc[1:3]

Unnamed: 0,A,B,C,D
1,0.277662,1.168946,-0.456736,-0.824495
2,-1.742185,-1.623243,-0.188642,-0.16413


In [18]:
df.iloc[1:3,0:2]

Unnamed: 0,A,B
1,0.277662,1.168946
2,-1.742185,-1.623243


Построим графики кумулятивных сумм - мировые линии четырёх пьяных, у которых величина каждого шага - гауссова случайная величина.

In [19]:
cs=df.cumsum()
cs

Unnamed: 0,A,B,C,D
0,0.706305,-0.789569,-0.692519,0.340655
1,0.983967,0.379378,-1.149255,-0.483839
2,-0.758218,-1.243865,-1.337898,-0.647969
3,-1.245135,-1.64853,-0.50921,1.312965
4,-1.25424,-0.427422,-0.109322,4.388656
5,-0.813661,1.085784,0.141971,5.226978
6,0.986519,2.963035,-0.579848,3.589078
7,1.505282,4.335864,-1.093447,3.044624
8,0.68217,3.147274,0.767999,4.317469
9,1.78544,1.766788,0.995548,5.086883


In [20]:
plt.plot(cs)
plt.savefig('b24_pandas_2.pdf')
plt.clf()