# Pandas

Pandas yra duomenų analizės biblioteka, sukurta NumPy pagrindu. Pandas yra pagrindinis įrankis Python aplinkoje, skirtas duomenų analizei, išvalymui ir paruošimui. Pandas pasižymi sparta ir produktyvumu. Galima dirbti su duomenimis iš įvairių šaltinių. 

Pandas diegiasi *conda install pandas* arba *pip install pandas*

In [1]:
import numpy as np
import pandas as pd

# Serijos


Serijos (*Series*) yra smulkus pandas duomenų darinys, sukurtas ant NumPy array pagrindo. 

In [2]:
labels = ['x', 'y', 'z']
data = [20, 30, 40]
pd.Series(data=data)

0    20
1    30
2    40
dtype: int64

Matyti, kad nuo įprastų masyvų, pandas serija skiriasi tuo, kad turi indeksaciją. Vienas iš parametrų, kuriuos galime perduoti kurdami seriją yra index.

In [3]:
pd.Series(data=data, index=labels)

x    20
y    30
z    40
dtype: int64

Pandas series galima kurti ir su python žodynais:

In [10]:
zodynas = {'x':20, 'y':30, 'z':40}
pd.Series(zodynas)

x    20
y    30
z    40
dtype: int64

**Reikšmės traukimas iš serijos**

In [7]:
serija = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Klaipėda', 'Panevėžys', 'Šiauliai'])
# atkreipkite dėmesį, kad duomenis galima sudėti nebūtinai nurodant parametro pavadinimą.

In [8]:
serija


Vilnius      1
Kaunas       2
Klaipėda     3
Panevėžys    4
Šiauliai     5
dtype: int64

In [9]:
serija['Vilnius']

1

**Operacijos su serijomis**

In [14]:
serija2 = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Lentvaris', 'Šiauliai', 'Klaipėda'])

In [15]:
serija2

Vilnius      1
Kaunas       2
Lentvaris    3
Šiauliai     4
Klaipėda     5
dtype: int64

naudojant sudėtį, pandas pagal galimybes bandys sumuoti reikšmes:

In [16]:
serija + serija2

Kaunas       4.0
Klaipėda     8.0
Lentvaris    NaN
Panevėžys    NaN
Vilnius      2.0
Šiauliai     9.0
dtype: float64

Ten, kur pandos negalėjo atlikti sudėties veiksmo, sugeneravo NaN - *not a number*. Tiek Pandas, tiek NumPy mėgsta integer reikšmes versti į float, kad išlaikytų kiek įmanoma tikslesnę informaciją.

# DataFrames

DataFrames yra pagrindinis pandas operacijų objektas. Jeigu norime susikurti naują DF, reikia į parametrus perduoti *data*, *index*, *columns*: 

In [25]:
df = pd.DataFrame(
    np.random.rand(5,6), 
    ['a', 'b', 'c', 'd', 'e'], 
    ['U', 'V', 'W', 'X', 'Y', 'Z'],
)

In [28]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


Kiekvienas stulpelis yra pandas serija, jos tarpusavyje dalijasi indeksais (a, b, c, d, e), pvz.:

In [29]:
df['U']

a    0.342764
b    0.831219
c    0.214500
d    0.294652
e    0.229890
Name: U, dtype: float64

In [30]:
type(df['U'])

pandas.core.series.Series

**Jei norime daugiau stulpelių:**

In [31]:
df[['U', 'Y', 'Z']]

Unnamed: 0,U,Y,Z
a,0.342764,0.344495,0.353028
b,0.831219,0.007365,0.753437
c,0.2145,0.970935,0.239019
d,0.294652,0.454541,0.851268
e,0.22989,0.274268,0.166787


**Naujo stulpelio sukūrimas**

In [32]:
df['naujas'] = [1, 2, 3, 4, 5]

In [33]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028,1
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437,2
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019,3
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268,4
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787,5


**Stulpelio ištrynimas**

In [34]:
df.drop('naujas', axis=1)

Unnamed: 0,U,V,W,X,Y,Z
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


axis=0 reikštų, kad atliekame veiksmą su eilute. 1 tuo tarpu reiškia stulpelį.

**Inplace parametras**

paskutinis mūsų veiksmas originalaus šaltinio nepakeitė, jeigu dabar išsikviesime df, matysime, kad jis koks buvo, toks ir liko: 

In [36]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028,1
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437,2
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019,3
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268,4
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787,5


norėdami pakeisti originalą, turime nurodyti parametrą inplace=True:

In [39]:
df.drop('naujas', axis=1, inplace=True)

In [42]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


*inplace* parametras apsaugo mus nuo netyčinio duomenų sugadinimo

**Pabandykime ištrinti eilutę:**

In [43]:
df.drop('e')

Unnamed: 0,U,V,W,X,Y,Z
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268


trinant eilutę parametro axis=0 nurodyti nebūtina, tai yra *default* reikšmė

**Eilučių traukimas**

In [44]:
df.loc['e']

U    0.229890
V    0.994205
W    0.612502
X    0.138417
Y    0.274268
Z    0.166787
Name: e, dtype: float64

eilutes galime traukti ir pagal indeksą:

In [45]:
df.iloc[4]

U    0.229890
V    0.994205
W    0.612502
X    0.138417
Y    0.274268
Z    0.166787
Name: e, dtype: float64

**Subsets**

jeigu norime pavienės reikšmės iš lentelės:

In [46]:
df.loc['c', 'Z']

0.23901877036400765

jeigu norime fragmento iš eilučių ir stulpelių (*subset*):

In [47]:
df.loc[['a', 'c'], ['U', 'V', 'Z']]

Unnamed: 0,U,V,Z
a,0.342764,0.72389,0.353028
c,0.2145,0.757023,0.239019


**Duomenų traukimas pagal sąlygą:**

duomenų traukimas pagal sąlygą yra labai panašus, kaip ir numPy:

In [48]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


In [50]:
df[df>0.4] 

Unnamed: 0,U,V,W,X,Y,Z
a,,0.72389,,0.444053,,
b,0.831219,0.417063,0.759778,,,0.753437
c,,0.757023,0.698177,,0.970935,
d,,0.591534,,0.690523,0.454541,0.851268
e,,0.994205,0.612502,,,


kur reikšmės atitinką sąlygą, turime reikšmes, kur neatitinka - NaN.

jeigu prireiktų subset'o, kur stulpelio 'W' reikšmės yra > 0.5:

In [51]:
df[df['W']>0.5]

Unnamed: 0,U,V,W,X,Y,Z
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


Skirtumas tarp šių operacijų toks, kad kai sąlygą taikome visam DataFrame'ui, gauname tą patį DataFrame su NaN reikšmėmis, tose vietose, kur originalios reikšmės neatitinka sąlygos. Kai sąlygą taikome stulpeliams, gauname tik tas eilutes, kurios atitinka sąlygą, t.y. vykdome filtravimą.

**Užklausų kombinavimas**

In [52]:
df[df['W']>0.5][['U', 'W', 'Z']]

Unnamed: 0,U,W,Z
b,0.831219,0.759778,0.753437
c,0.2145,0.698177,0.239019
e,0.22989,0.612502,0.166787


šiame pavyzdyje gauname rezultatą, kokį gautumėm paeiliui ivykdę dvi atskiras eilutes: *df1 = df[df['W']>0.5], df1[['U', 'W', 'Z']]*. Užklausų kombinavimas leidžia mums nekurti atmintyje papildomų kintamųjų (kaip šiuo atveju *df1*).

**Sąlygų kombinavimas**

In [53]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


In [56]:
df[(df['W']>0.5) & (df['U']<0.5)]

Unnamed: 0,U,V,W,X,Y,Z
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


gavome tas eilutes, kuriose U stulpelyje reikšmės didesnės, o Z stulpelyje mažesnės už 0.5.

In [57]:
df[(df['W']>0.5) & (df['U']<0.5)][['W', 'U']]

Unnamed: 0,W,U
c,0.698177,0.2145
e,0.612502,0.22989


Čia sukombinavome dvi sąlygas ir iš rezultato paprašėme tik 2jų stulpelių

**Operacijos su index stulpeliu**

reset_index paverčia mūsų seną indeksą dar vienu stulpeliu, ir sukuria naują indeksą iš skaičių. Reikia naudoti *inplace=True*, jei norime pakeisti originalą.

In [58]:
df.reset_index()

Unnamed: 0,index,U,V,W,X,Y,Z
0,a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
1,b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
2,c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
3,d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
4,e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787


Norint sukurti naują indeksą, reikia pridėti naują stulpelį:

In [59]:
naujas_indeksas = 'Vilnius Kaunas Klaipėda Šiauliai Panevėžys'.split()

In [60]:
naujas_indeksas

['Vilnius', 'Kaunas', 'Klaipėda', 'Šiauliai', 'Panevėžys']

In [61]:
df['Miestai'] = naujas_indeksas

In [63]:
df

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028,Vilnius
b,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437,Kaunas
c,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019,Klaipėda
d,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268,Šiauliai
e,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787,Panevėžys


In [65]:
df.set_index('Miestai')

Unnamed: 0_level_0,U,V,W,X,Y,Z
Miestai,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Vilnius,0.342764,0.72389,0.18816,0.444053,0.344495,0.353028
Kaunas,0.831219,0.417063,0.759778,0.063848,0.007365,0.753437
Klaipėda,0.2145,0.757023,0.698177,0.323608,0.970935,0.239019
Šiauliai,0.294652,0.591534,0.318895,0.690523,0.454541,0.851268
Panevėžys,0.22989,0.994205,0.612502,0.138417,0.274268,0.166787
