# Pandas - datov√© typy a manipulace se sloupci

V minul√© lekci jsme si p≈ôedstavili knihovnu pandas a jej√≠ z√°kladn√≠ t≈ô√≠dy: `Series`, `DataFrame` a `Index`. Brali jsme je ov≈°em jako statick√© objekty, na kter√© jsme se pouze d√≠vali.

V t√©to lekci zaƒçneme upravovat existuj√≠c√≠ tabulky. Uk√°≈æeme si:

* jak p≈ôidat ƒçi ubrat sloupce a ≈ô√°dky
* jak zmƒõnit hodnotu konkr√©tn√≠ bu≈àky
* jak√© datov√© typy se hod√≠ pro kter√Ω √∫ƒçel
* aritmetick√© a dal≈°√≠ operace, kter√© lze se sloupci prov√°dƒõt
* filtrov√°n√≠ a ≈ôazen√≠ ≈ô√°dk≈Ø

A jeliko≈æ o v√Ωsledky pr√°ce urƒçitƒõ nechce≈° p≈ôij√≠t, p≈ôijde nakonec vhod i ukl√°d√°n√≠ v√Ωsledk≈Ø do extern√≠ch soubor≈Ø.

In [1]:
import pandas as pd

## Manipulace s DataFrames

Pro rozeh≈ô√°t√≠ budeme pracovat s malou tabulkou obsahuj√≠c√≠ nƒõkolik z√°kladn√≠ch informac√≠ o planet√°ch, kter√© snadno najde≈° nap≈ô. na [wikipedii](https://en.wikipedia.org/wiki/Planet).

In [2]:
planety = pd.DataFrame({
    "jmeno": ["Merkur", "Venu≈°e", "Zemƒõ", "Mars", "Jupiter", "Saturn", "Uran", "Neptun"],
    "symbol": ["‚òø", "‚ôÄ", "‚äï", "‚ôÇ", "‚ôÉ", "‚ôÑ", "‚ôÖ", "‚ôÜ"],
    "obezna_poloosa": [0.39, 0.72, 1.00, 1.52, 5.20, 9.54, 19.22, 30.06],
    "obezna_doba": [0.24, 0.62, 1, 1.88, 11.86, 29.46, 84.01, 164.8]
})
planety = planety.set_index("jmeno")    # S jmenn√Ωm indexem se ti bude sn√°ze pracovat
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Merkur,‚òø,0.39,0.24
Venu≈°e,‚ôÄ,0.72,0.62
Zemƒõ,‚äï,1.0,1.0
Mars,‚ôÇ,1.52,1.88
Jupiter,‚ôÉ,5.2,11.86
Saturn,‚ôÑ,9.54,29.46
Uran,‚ôÖ,19.22,84.01
Neptun,‚ôÜ,30.06,164.8


### P≈ôid√°n√≠ nov√©ho sloupce

Kdy≈æ chceme p≈ôidat nov√Ω sloupec (`Series`), p≈ôi≈ôad√≠me ho do `DataFrame` jako hodnotu do slovn√≠ku - tedy v hranat√Ωch z√°vork√°ch s n√°zvem sloupce. Dobr√° zpr√°va je, ≈æe stejnƒõ jako v konstruktoru, `pandas` si "porad√≠" jak se `Series`, tak s obyƒçejn√Ωm seznamem.

V na≈°em konkr√©tn√≠m p≈ô√≠padƒõ si najdeme a p≈ôid√°me poƒçet zn√°m√Ωch mƒõs√≠c≈Ø (velk√Ωch i mal√Ωch).

In [3]:
mesice = [0, 0, 1, 2, 79, 82, 27, 14]      # Alternativnƒõ mesice = pd.Series([...])
planety["mesice"] = mesice
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Merkur,‚òø,0.39,0.24,0
Venu≈°e,‚ôÄ,0.72,0.62,0
Zemƒõ,‚äï,1.0,1.0,1
Mars,‚ôÇ,1.52,1.88,2
Jupiter,‚ôÉ,5.2,11.86,79
Saturn,‚ôÑ,9.54,29.46,82
Uran,‚ôÖ,19.22,84.01,27
Neptun,‚ôÜ,30.06,164.8,14


üí° V tomto p≈ô√≠padƒõ jsme p≈ô√≠mo upravili existuj√≠c√≠ `DataFrame`. Vƒõt≈°ina metod / operac√≠ (u≈æ zn√°≈° nap≈ô. `set_index`) ve v√Ωchoz√≠m nastaven√≠ v≈ædy vrac√≠ nov√Ω objekt - je to dobr√Ωm zvykem, kter√Ω budeme dodr≈æovat. P≈ôi≈ôazov√°n√≠ sloupc≈Ø je jednou z v√Ωjimek tohoto jinak uzn√°van√©ho pravidla (tou druhou je pohodlnost).

<div style="background-color: yellow; color: red"><b>TODO</b>: 
   Jak to p√≠≈°u, tak mi to zase tak samoz≈ôejm√© nep≈ôijde. Nƒõjak bych tohle chtƒõl zformulovat l√≠p.</div>
   
`DataFrame` nab√≠z√≠ je≈°tƒõ metodu `assign`, kter√° nemƒõn√≠ tabulku, ale vytv√°≈ô√≠ jej√≠ kopii s p≈ôidan√Ωmi (nebo nahrazen√Ωmi) sloupci:

In [4]:
# Nov√Ω doƒçasn√Ω DataFrame
planety.assign(
    je_stavebnice=[True, False, False, False, False, False, False, False],
    ma_vztah_k_vestonicim=[False, True, False, False, False, False, False, False],
)

# Objekt `planety` z≈Østal nezmƒõnƒõn.

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_stavebnice,ma_vztah_k_vestonicim
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,‚òø,0.39,0.24,0,True,False
Venu≈°e,‚ôÄ,0.72,0.62,0,False,True
Zemƒõ,‚äï,1.0,1.0,1,False,False
Mars,‚ôÇ,1.52,1.88,2,False,False
Jupiter,‚ôÉ,5.2,11.86,79,False,False
Saturn,‚ôÑ,9.54,29.46,82,False,False
Uran,‚ôÖ,19.22,84.01,27,False,False
Neptun,‚ôÜ,30.06,164.8,14,False,False


**√ökol**: Zkus (jedn√≠m ƒçi druh√Ωm zp≈Øsobem) p≈ôidat sloupec s rokem objevu (`"objeveno"`). √ödaje najde≈° nap≈ô. zde: https://cs.wikipedia.org/wiki/Slune%C4%8Dn%C3%AD_soustava.

Nen√≠ to zase tak ƒçasto praktick√©, ale pro hodnoty nov√©ho sloupce lze pou≈æ√≠t i jednu skal√°rn√≠ hodnotu:

In [5]:
planety["je_planeta"] = True     # Platilo do roku 2006

### P≈ôid√°n√≠ nov√©ho ≈ô√°dku

Kdy≈æ se strojem ƒçasu vr√°t√≠me do dƒõtstv√≠ (nebo ran√© dospƒõlosti) autor≈Ø tƒõchto materi√°l≈Ø, tedy p≈ôed rok 2006, kdy se v Praze konal astronomick√Ω kongres, kter√Ω definoval pojem "planeta" (ale ne p≈ôed rok 1930!), p≈ôibude n√°m nov√° planeta: Pluto.

Do na≈°√≠ tabulky ho vlo≈æ√≠me pomoc√≠ indexeru `loc`, kter√Ω jsme ji≈æ d≈ô√≠ve pou≈æ√≠vali pro "kouk√°n√≠" do tabulky:

In [6]:
planety.loc["Pluto"] = ["‚ôá", 39.48, 247.94, 5, True]   # Seznam hodnot v ≈ô√°dku
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,‚òø,0.39,0.24,0,True
Venu≈°e,‚ôÄ,0.72,0.62,0,True
Zemƒõ,‚äï,1.0,1.0,1,True
Mars,‚ôÇ,1.52,1.88,2,True
Jupiter,‚ôÉ,5.2,11.86,79,True
Saturn,‚ôÑ,9.54,29.46,82,True
Uran,‚ôÖ,19.22,84.01,27,True
Neptun,‚ôÜ,30.06,164.8,14,True
Pluto,‚ôá,39.48,247.94,5,True


### Zmƒõna hodnoty bu≈àky

"Indexery" `.loc` a `.iloc` se dvƒõma argumenty v hranat√Ωch z√°vork√°ch odkazuj√≠ p≈ô√≠mo na konkr√©tn√≠ bu≈àku, a p≈ôi≈ôazen√≠m do nich (opƒõt, podobnƒõ jako ve slovn√≠ku) se hodnota na p≈ô√≠slu≈°n√© m√≠sto zap√≠≈°e. Jen je t≈ôeba zachovat po≈ôad√≠ (≈ô√°dek, sloupec). 

Vr√°t√≠me se opƒõt do souƒçasnosti a Pluto zbav√≠me jeho privilegi√≠:

In [7]:
planety.loc["Pluto", "je_planeta"] = False
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,‚òø,0.39,0.24,0,True
Venu≈°e,‚ôÄ,0.72,0.62,0,True
Zemƒõ,‚äï,1.0,1.0,1,True
Mars,‚ôÇ,1.52,1.88,2,True
Jupiter,‚ôÉ,5.2,11.86,79,True
Saturn,‚ôÑ,9.54,29.46,82,True
Uran,‚ôÖ,19.22,84.01,27,True
Neptun,‚ôÜ,30.06,164.8,14,True
Pluto,‚ôá,39.48,247.94,5,False


**‚ö† Pozor:** Podobnƒõ jako ve slovn√≠ku, ale mo≈æn√° ponƒõkud neintuitivnƒõ, je mo≈æn√© zapsat hodnotu do ≈ô√°dku i sloupce, kter√© neexistuj√≠!

In [8]:
planety_bad = planety.copy()     # Pro jistotu si udƒõl√°me kopii

planety_bad.loc["Zeme", "planeta"] = True
planety_bad

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta,planeta
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,‚òø,0.39,0.24,0.0,True,
Venu≈°e,‚ôÄ,0.72,0.62,0.0,True,
Zemƒõ,‚äï,1.0,1.0,1.0,True,
Mars,‚ôÇ,1.52,1.88,2.0,True,
Jupiter,‚ôÉ,5.2,11.86,79.0,True,
Saturn,‚ôÑ,9.54,29.46,82.0,True,
Uran,‚ôÖ,19.22,84.01,27.0,True,
Neptun,‚ôÜ,30.06,164.8,14.0,True,
Pluto,‚ôá,39.48,247.94,5.0,False,
Zeme,,,,,,True


P≈ôi≈ôazovat je mo≈æn√© i do rozsah≈Ø v indexech - jen je pot≈ôeba hl√≠dat, aby p≈ôi≈ôazovan√° hodnota ƒçi hodnoty byly buƒè skal√°rem, nebo mƒõly stejn√Ω tvar jako oblast, do kter√© p≈ôi≈ôazujeme:

In [9]:
planety.loc["Merkur":"Mars", "je_obr"] = False
planety.loc["Jupiter":"Neptun", "je_obr"] = [True, True, True, True]
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_planeta,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Merkur,‚òø,0.39,0.24,0,True,False
Venu≈°e,‚ôÄ,0.72,0.62,0,True,False
Zemƒõ,‚äï,1.0,1.0,1,True,False
Mars,‚ôÇ,1.52,1.88,2,True,False
Jupiter,‚ôÉ,5.2,11.86,79,True,True
Saturn,‚ôÑ,9.54,29.46,82,True,True
Uran,‚ôÖ,19.22,84.01,27,True,True
Neptun,‚ôÜ,30.06,164.8,14,True,True
Pluto,‚ôá,39.48,247.94,5,False,




**√ökol:** Shodou okolnost√≠ (nebo jde o astronomickou nevyhnutelnost?) maj√≠ v≈°ichni planet√°rn√≠ ob≈ôi alespo≈à nƒõjak√Ω prstenec. Dok√°≈æe≈° jednodu≈°e vytvo≈ôit sloupec `"ma_prstenec"`?

### Odstranƒõn√≠ sloupce

Pro odebr√°n√≠ sloupce ƒçi ≈ô√°dku z `DataFrame` slou≈æ√≠ metoda `drop`. Jej√≠ prvn√≠ argument oƒçek√°v√° oznaƒçen√≠ (index) jednoho nebo v√≠ce ≈ô√°dk≈Ø ƒçi sloupc≈Ø, kter√© chce≈° odebrat. Argument `axis` oznaƒçuje, ve kter√© dimenzi se operace m√° aplikovat (0 ƒçi 1). ƒå√≠slo je intuitivn√≠ a odpov√≠d√° po≈ôad√≠, ve kter√©m se uv√°dƒõj√≠ kl√≠ƒçe p≈ôi odkazov√°n√≠ na bu≈àky.

Osa (`axis`):
- 0 = ≈ô√°dky
- 1 = sloupce

(Tento argument pou≈æ√≠vaj√≠ i ƒçetn√© dal≈°√≠ metody a funkce, proto se ujisti, ≈æe mu rozum√≠≈°).

In [10]:
# Odstran√≠me zbyteƒçn√Ω sloupec s informaƒçn√≠ hodnotou na √∫rovni "stƒõraƒçe st√≠raj√≠, klakson troub√≠"
planety = planety.drop("je_planeta", axis=1)   
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,‚òø,0.39,0.24,0,False
Venu≈°e,‚ôÄ,0.72,0.62,0,False
Zemƒõ,‚äï,1.0,1.0,1,False
Mars,‚ôÇ,1.52,1.88,2,False
Jupiter,‚ôÉ,5.2,11.86,79,True
Saturn,‚ôÑ,9.54,29.46,82,True
Uran,‚ôÖ,19.22,84.01,27,True
Neptun,‚ôÜ,30.06,164.8,14,True
Pluto,‚ôá,39.48,247.94,5,


<span style="color: red; left: 50%; top: 0.5em; font-weight: bold; position: absolute; opacity: 0.3; width: 0px; height: 0px; font-size: 6em">‚õß</span> Metoda `drop`, v souladu s v√Ω≈°e zm√≠nƒõnou konvenc√≠, vrac√≠ nov√Ω `DataFrame` (a proto v√Ωsledek operace mus√≠me p≈ôi≈ôadit do `planety`). Pokud chce≈° operovat rovnou na tabulce, m≈Ø≈æe≈° pou≈æ√≠t p≈ô√≠kaz `del` (funguje stejnƒõ jako u slovn√≠ku) nebo poprosit pand√≠ bohy (a autory tƒõchto materi√°l≈Ø) o odpu≈°tƒõn√≠ a p≈ôidat argument `inplace=True`:

In [11]:
# Alternativa 1)
# del planety["je_planeta"]

# Alternativa 2)
# planety.drop("je_planeta", axis=1, inplace=True)

### Odstranƒõn√≠ ≈ô√°dku

Vr√°t√≠me se zp√°tky do budoucnosti (resp. souƒçasnosti) a vypo≈ô√°d√°me se nemilosrdnƒõ s Plutem.

Opƒõt pou≈æijeme metodu `drop` se spr√°vnou hodnotou argument `axis`, tedy 0. Na≈°tƒõst√≠ pro n√°s, tato hodnota je v√Ωchoz√≠, a tak m≈Ø≈æeme argument √∫plnƒõ vynechat:

In [12]:
planety = planety.drop("Pluto")   # P≈ôidej axis=0, chce≈°-li b√Ωt explicitn√≠
planety

Unnamed: 0_level_0,symbol,obezna_poloosa,obezna_doba,mesice,je_obr
jmeno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Merkur,‚òø,0.39,0.24,0,False
Venu≈°e,‚ôÄ,0.72,0.62,0,False
Zemƒõ,‚äï,1.0,1.0,1,False
Mars,‚ôÇ,1.52,1.88,2,False
Jupiter,‚ôÉ,5.2,11.86,79,True
Saturn,‚ôÑ,9.54,29.46,82,True
Uran,‚ôÖ,19.22,84.01,27,True
Neptun,‚ôÜ,30.06,164.8,14,True


## Datov√© typy

Jak u≈æ jsme p≈ôedeslali, datov√© typy v pandas se trochu li≈°√≠ od typ≈Ø v Pythonu, ale na≈°tƒõst√≠ konverze mezi nimi je vƒõt≈°inou automatick√° a "chovaj√≠c√≠ se dle oƒçek√°v√°n√≠".

#### P≈ô√≠prava dat

V datov√©m kurzu budeme vyu≈æ√≠vat r≈Øzn√Ωch datov√Ωch sad (obvykle vƒõt≈°√≠ch - takov√Ωch, kde nen√≠ praktick√© je cel√© zapsat v konstruktoru). Nyn√≠ opust√≠me planety a pod√≠v√°me se na nƒõkter√© zaj√≠mav√© charakteristiky zem√≠ kolem svƒõta (je≈æto definice toho, co je to zemƒõ, je ponƒõkud v√°gn√≠, bereme v potaz ƒçleny OSN), zachycen√© k jednomu konkr√©tn√≠mu roku uplynul√© dek√°dy (proto≈æe ne v≈ædy jsou v≈°echny √∫daje k dispozici, bereme posledn√≠ rok, kde je zn√°mo dost ukazatel≈Ø). Data poch√°zej√≠ povƒõt≈°inou z projektu [Gapminder](https://www.gapminder.org/), doplnili jsme je jen o nƒõkolik dal≈°√≠ch informac√≠ z wikipedie.

<div style="background-color: yellow; color: red">TODO: Upravit URL podle toho, kde nakonec data budou.</div>

In [13]:
url = "https://raw.githubusercontent.com/janpipek/data-pro-pyladies/master/data/countries.csv"
countries = pd.read_csv(url, index_col="name")   # M√≠sto `set_index`
countries

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,AFG,south_asia,asia,low_income,False,False,,2018,652860.0,34500000.0,0.03,20.62,21.07,,2090.0,66.3,58.69,65.812,63.101,1946-11-19
Albania,ALB,europe_central_asia,europe,upper_middle_income,False,False,,2018,28750.0,3238000.0,7.29,26.45,25.66,5.978,3193.0,12.5,78.01,80.737,76.693,1955-12-14
Algeria,DZA,middle_east_north_africa,africa,upper_middle_income,False,False,,2018,2381740.0,36980000.0,0.69,24.60,26.37,,3296.0,21.9,77.86,77.784,75.279,1962-10-08
Andorra,AND,europe_central_asia,europe,high_income,False,False,,2017,470.0,88910.0,10.17,27.63,26.43,,,2.1,82.55,,,1993-07-28
Angola,AGO,sub_saharan_africa,africa,upper_middle_income,False,False,,2018,1246700.0,20710000.0,5.57,22.25,23.48,,2473.0,96.0,65.19,64.939,59.213,1976-12-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,VEN,america,americas,upper_middle_income,False,False,,2018,912050.0,30340000.0,7.60,27.45,28.13,7.332,2631.0,12.9,75.91,79.079,70.950,1945-11-15
Vietnam,VNM,east_asia_pacific,asia,lower_middle_income,False,False,,2018,330967.0,90660000.0,3.91,20.92,21.07,,2745.0,17.3,74.88,81.203,72.003,1977-09-20
Yemen,YEM,middle_east_north_africa,asia,lower_middle_income,False,False,,2018,527970.0,26360000.0,0.20,24.44,26.11,,2223.0,33.8,67.14,66.871,63.875,1947-09-30
Zambia,ZMB,sub_saharan_africa,africa,lower_middle_income,False,False,,2018,752610.0,14310000.0,3.56,20.68,23.05,11.260,1930.0,43.3,59.45,65.362,59.845,1964-12-01


Nam√°tkou si vybereme nƒõjakou n√°hodnou\* zemi a pod√≠v√°me se, jak√© √∫daje o n√≠ v tabulce m√°me.

\**Trochu si ≈°tƒõst√≠ƒçko p≈ôiohneme, ale uznej, ≈æe 42 je ƒç√≠slo stejnƒõ dobr√© nebo lep≈°√≠ ne≈æ kter√©koliv jin√©.*

In [14]:
import numpy; numpy.random.seed(42)  # "Usmƒõrnƒõn√≠" n√°hody

row = countries.sample(1).iloc[0]
row

# Alternativa pro m√©nƒõ d≈Øvƒõ≈ôiv√©:
#  countries.loc["Czechia"]

iso                                             CZE
world_6region                   europe_central_asia
world_4region                                europe
income_groups                           high_income
is_eu                                          True
is_oecd                                        True
eu_accession                             2004-05-01
year                                           2018
area                                          78870
population                                1.059e+07
alcohol_adults                                16.47
bmi_men                                       27.91
bmi_women                                     26.51
car_deaths_per_100000_people                   5.72
calories_per_day                               3256
infant_mortality                                2.8
life_expectancy                               79.37
life_expectancy_female                       81.858
life_expectancy_male                         76.148
un_accession

U≈æ na prvn√≠ pohled je ka≈æd√© pole jin√©ho typu. Ale jak√©ho? Na to n√°m odpov√≠ vlastnost `dtypes` na≈°√≠ tabulky (Pamatuj: U `Series` jsme pou≈æili `dtype`).

In [15]:
countries.dtypes

iso                              object
world_6region                    object
world_4region                    object
income_groups                    object
is_eu                              bool
is_oecd                            bool
eu_accession                     object
year                              int64
area                            float64
population                      float64
alcohol_adults                  float64
bmi_men                         float64
bmi_women                       float64
car_deaths_per_100000_people    float64
calories_per_day                float64
infant_mortality                float64
life_expectancy                 float64
life_expectancy_female          float64
life_expectancy_male            float64
un_accession                     object
dtype: object

Kdy≈æ pandas naƒç√≠t√° z tabulek, sna≈æ√≠ se automaticky rozpoznat ƒç√≠sla (vƒçetnƒõ druhu) a logick√© hodnoty. U v≈°ech zbyl√Ωch sloupc≈Ø to nech√°v√° na tobƒõ a ponƒõkud def√©tisticky je pova≈æuje za "objekty". Nicm√©nƒõ to nejsou v≈°echny (ani v≈°echny bƒõ≈æn√©) typy. Nav√≠c z Pythonu sice zn√°≈° `float` a `int`, ale proƒç je souƒç√°st√≠ n√°zvu i ƒç√≠slo `64`? Pojƒème na to tedy od lesa.

## Typy sloupc≈Ø

Typy v pandas vych√°zej√≠ z toho, jak je definuje knihovna `numpy` (obecnƒõ u≈æiteƒçn√° pro pr√°ci s numerick√Ωmi poli a poskytuj√≠c√≠ vektorov√© operace s rychlost√≠ ≈ô√°dovƒõ rychlej≈°√≠ ne≈æ v Pythonu jako takov√©m). Ta pot≈ôebuje p≈ôedev≈°√≠m vƒõdƒõt, jak alokovat pole pro prvky dan√©ho typu - na to, aby mohly b√Ωt se≈ôazeny efektivnƒõ jeden za druh√Ωm, a tedy i kolik bajt≈Ø pamƒõti ka≈æd√Ω zab√≠r√°. Kop√≠ruje p≈ôitom "nativn√≠" datov√© typy, jako je m≈Ø≈æe≈° zn√°t, pokud u≈æ m√°≈° takovou zku≈°enost, nap≈ô. z jazyka C. Um√≠stƒõn√≠ pamƒõti je nƒõco, co v Pythonu obvykle ne≈ôe≈°√≠me, ale rychl√© poƒç√≠t√°n√≠ se bez toho neobejde. My nep≈Øjdeme do detail≈Ø, ale po≈æadavek na rychlost se n√°m tu a tam vyno≈ô√≠ a my budeme kl√°st d≈Øraz na to, aby se operace dƒõlaly "vektorovƒõ", ≈ôe≈°ily "na √∫rovni numpy".

Ponƒõkud kryptick√Ω syst√©m numpy (popsan√Ω v [dokumentaci](https://docs.scipy.org/doc/numpy/user/basics.types.html)) je na≈°tƒõst√≠ v pandas zjednodu≈°en a umo≈æ≈àuje jen nƒõkolik z√°kladn√≠ch (rodin) typ≈Ø.

**celoƒç√≠seln√© typy**

**ƒç√≠sla s plovouc√≠ desetinnou ƒç√°rkou**

**objekty**

**kategorick√©**

**datum / ƒças**

**logick√©** 

In [16]:
countries.dtypes

iso                              object
world_6region                    object
world_4region                    object
income_groups                    object
is_eu                              bool
is_oecd                            bool
eu_accession                     object
year                              int64
area                            float64
population                      float64
alcohol_adults                  float64
bmi_men                         float64
bmi_women                       float64
car_deaths_per_100000_people    float64
calories_per_day                float64
infant_mortality                float64
life_expectancy                 float64
life_expectancy_female          float64
life_expectancy_male            float64
un_accession                     object
dtype: object

In [17]:
countries.world_6region.astype("category")

name
Afghanistan                  south_asia
Albania             europe_central_asia
Algeria        middle_east_north_africa
Andorra             europe_central_asia
Angola               sub_saharan_africa
                         ...           
Venezuela                       america
Vietnam               east_asia_pacific
Yemen          middle_east_north_africa
Zambia               sub_saharan_africa
Zimbabwe             sub_saharan_africa
Name: world_6region, Length: 193, dtype: category
Categories (6, object): [america, east_asia_pacific, europe_central_asia, middle_east_north_africa, south_asia, sub_saharan_africa]

In [18]:
countries["population"] = countries["population"].astype("int")

## Matematika

In [19]:
countries["population"] / countries["area"]

name
Afghanistan     52.844408
Albania        112.626087
Algeria         15.526464
Andorra        189.170213
Angola          16.611855
                  ...    
Venezuela       33.265720
Vietnam        273.924591
Yemen           49.927079
Zambia          19.013832
Zimbabwe        34.113011
Length: 193, dtype: float64

In [20]:
countries["population"].sum(), countries["area"].sum()

(7085561598, 133823385.5)

In [21]:
from datetime import datetime
datetime.now() - pd.to_datetime(countries["eu_accession"]).dropna()

name
Austria           9121 days 13:35:50.205568
Belgium          24623 days 13:35:50.205568
Bulgaria          4738 days 13:35:50.205568
Croatia           2546 days 13:35:50.205568
Cyprus            5713 days 13:35:50.205568
Czechia           5713 days 13:35:50.205568
Denmark          17156 days 13:35:50.205568
Estonia           5713 days 13:35:50.205568
Finland           9121 days 13:35:50.205568
France           24623 days 13:35:50.205568
Germany          24623 days 13:35:50.205568
Greece           14234 days 13:35:50.205568
Hungary           5713 days 13:35:50.205568
Ireland          17156 days 13:35:50.205568
Italy            24623 days 13:35:50.205568
Latvia            5713 days 13:35:50.205568
Lithuania         5713 days 13:35:50.205568
Luxembourg       24623 days 13:35:50.205568
Malta             5713 days 13:35:50.205568
Netherlands      24623 days 13:35:50.205568
Poland            5713 days 13:35:50.205568
Portugal         12408 days 13:35:50.205568
Romania           4738 days

## Filtrov√°n√≠

Zat√≠m jsme 

In [22]:
countries["is_eu"].value_counts()

False    165
True      28
Name: is_eu, dtype: int64

In [23]:
countries[countries["is_eu"]]

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000,12.4,26.47,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000,10.41,26.76,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27
Bulgaria,BGR,europe_central_asia,europe,upper_middle_income,True,False,2007-01-01,2018,111000.0,7349000,11.4,26.54,25.52,9.662,2829.0,9.3,75.32,78.485,71.618,1955-12-14
Croatia,HRV,europe_central_asia,europe,high_income,True,False,2013-01-01,2018,56590.0,4379000,15.0,26.6,25.18,6.434,3059.0,3.6,77.66,81.167,74.701,1992-05-22
Cyprus,CYP,europe_central_asia,europe,high_income,True,False,2004-05-01,2018,9250.0,1141000,8.84,27.42,25.93,6.419,2649.0,2.5,80.79,82.918,78.734,1960-09-20
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000,16.47,27.91,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19
Denmark,DNK,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,42922.0,5611000,12.02,26.13,25.11,3.481,3367.0,2.9,81.1,82.878,79.13,1945-10-24
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000,17.24,26.26,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17
Finland,FIN,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,338420.0,5419000,13.1,26.73,25.58,3.615,3368.0,1.9,82.06,84.423,78.934,1955-12-14
France,FRA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,549087.0,63780000,12.48,25.85,24.83,2.491,3482.0,3.5,82.62,85.747,79.991,1945-10-24


In [24]:
countries.query("is_oecd")

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Australia,AUS,east_asia_pacific,asia,high_income,False,True,,2018,7741220.0,23210000,10.21,27.56,26.88,5.335,3276.0,3.0,82.87,85.102,81.39,1945-11-01
Austria,AUT,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,83879.0,8441000,12.4,26.47,25.09,3.541,3768.0,2.9,81.84,84.249,79.585,1955-12-14
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000,10.41,26.76,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27
Canada,CAN,america,americas,high_income,False,True,,2018,9984670.0,34990000,10.2,27.45,26.7,6.333,3494.0,4.3,82.16,84.509,80.859,1945-11-09
Chile,CHL,america,americas,high_income,False,True,,2018,756096.0,17570000,8.81,27.02,27.93,3.329,2979.0,7.0,80.66,82.272,77.416,1945-10-24
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000,16.47,27.91,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19
Denmark,DNK,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,42922.0,5611000,12.02,26.13,25.11,3.481,3367.0,2.9,81.1,82.878,79.13,1945-10-24
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000,17.24,26.26,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17
Finland,FIN,europe_central_asia,europe,high_income,True,True,1995-01-01,2018,338420.0,5419000,13.1,26.73,25.58,3.615,3368.0,1.9,82.06,84.423,78.934,1955-12-14
France,FRA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,549087.0,63780000,12.48,25.85,24.83,2.491,3482.0,3.5,82.62,85.747,79.991,1945-10-24


## ≈òazen√≠

V √∫vodn√≠ lekci `pandas` jsme si ji≈æ uk√°zali, jak pomoc√≠ metody `sort_index` se≈ôadit ≈ô√°dky podle indexu.

In [25]:
countries["population"].sort_values()

name
Tuvalu                 9888
Nauru                 10440
Palau                 20920
San Marino            32160
Monaco                35460
                    ...    
Brazil            200100000
Indonesia         247200000
United States     318500000
India            1275000000
China            1359000000
Name: population, Length: 193, dtype: int32

In [26]:
countries["area"].sort_values(ascending=False)

name
Russia           17098250.0
Canada            9984670.0
United States     9831510.0
China             9562911.0
Brazil            8515770.0
                    ...    
Liechtenstein         160.0
San Marino             60.0
Tuvalu                 30.0
Nauru                  20.0
Monaco                  2.0
Name: area, Length: 193, dtype: float64

In [27]:
countries.sort_values("alcohol_adults", ascending=False).head(10)

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Moldova,MDA,europe_central_asia,europe,lower_middle_income,False,False,,2018,33850.0,3496000,23.01,24.24,27.06,5.529,2714.0,13.6,72.41,76.09,67.544,1992-03-02
South Korea,KOR,east_asia_pacific,asia,high_income,False,True,,2018,100280.0,48770000,19.15,23.99,23.33,4.319,3334.0,2.9,81.35,85.467,79.456,1991-09-17
Belarus,BLR,europe_central_asia,europe,upper_middle_income,False,False,,2018,207600.0,9498000,18.85,26.16,26.64,8.454,3250.0,3.4,73.76,78.583,67.693,1945-10-24
North Korea,PRK,east_asia_pacific,asia,low_income,False,False,,2018,120540.0,24650000,18.28,22.02,21.25,,2094.0,19.7,71.13,75.512,68.45,1991-09-17
Ukraine,UKR,europe_central_asia,europe,lower_middle_income,False,False,,2018,603550.0,44700000,17.47,25.42,26.23,8.771,3138.0,7.7,72.29,77.067,67.246,1945-10-24
Estonia,EST,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,45230.0,1339000,17.24,26.26,25.19,5.896,3253.0,2.3,77.66,82.111,73.201,1991-09-17
Czechia,CZE,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,78870.0,10590000,16.47,27.91,26.51,5.72,3256.0,2.8,79.37,81.858,76.148,1993-01-19
Uganda,UGA,sub_saharan_africa,africa,low_income,False,False,,2018,241550.0,36760000,16.4,22.36,22.48,13.69,2130.0,37.7,62.86,62.667,58.252,1962-10-25
Lithuania,LTU,europe_central_asia,europe,high_income,True,True,2004-05-01,2018,65286.0,3278000,16.3,26.86,26.01,8.09,3417.0,3.3,75.31,80.06,69.554,1991-09-17
Russia,RUS,europe_central_asia,europe,high_income,False,False,,2018,17098250.0,142600000,16.23,26.01,27.21,14.38,3361.0,8.2,71.07,76.882,65.771,1945-10-24


In [28]:
countries[countries["is_eu"]].sort_values(["eu_accession", "population"])

Unnamed: 0_level_0,iso,world_6region,world_4region,income_groups,is_eu,is_oecd,eu_accession,year,area,population,alcohol_adults,bmi_men,bmi_women,car_deaths_per_100000_people,calories_per_day,infant_mortality,life_expectancy,life_expectancy_female,life_expectancy_male,un_accession
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Luxembourg,LUX,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,2590.0,530000,12.84,27.43,26.09,5.971,3539.0,1.5,82.39,84.227,79.981,1945-10-24
Belgium,BEL,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,30530.0,10820000,10.41,26.76,25.14,5.427,3733.0,3.3,81.23,83.751,79.131,1945-12-27
Netherlands,NLD,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,41540.0,16760000,9.75,26.02,25.47,2.237,3228.0,3.2,81.92,83.841,80.44,1945-12-10
Italy,ITA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,301340.0,61090000,9.72,26.48,24.79,3.778,3579.0,2.9,82.62,85.435,81.146,1955-12-14
France,FRA,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,549087.0,63780000,12.48,25.85,24.83,2.491,3482.0,3.5,82.62,85.747,79.991,1945-10-24
Germany,DEU,europe_central_asia,europe,high_income,True,True,1952-07-23,2018,357380.0,81800000,12.14,27.17,25.74,3.28,3499.0,3.1,81.25,83.632,79.06,1973-09-18
Ireland,IRL,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,70280.0,4631000,14.92,27.65,26.62,3.768,3600.0,3.0,81.49,83.737,79.885,1955-12-14
Denmark,DNK,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,42922.0,5611000,12.02,26.13,25.11,3.481,3367.0,2.9,81.1,82.878,79.13,1945-10-24
United Kingdom,GBR,europe_central_asia,europe,high_income,True,True,1973-01-01,2018,243610.0,63180000,13.24,27.39,26.94,3.377,3424.0,3.5,81.19,83.558,80.127,1945-10-24
Greece,GRC,europe_central_asia,europe,high_income,True,True,1981-01-01,2018,131960.0,11450000,11.01,26.34,24.92,9.175,3400.0,3.6,81.34,84.071,79.129,1945-10-25


In [29]:
countries.assign(density=countries["population"] / countries["area"]).sort_values("density", ascending=False)[["population", "area", "density"]]

Unnamed: 0_level_0,population,area,density
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Monaco,35460,2.0,17730.000000
Singapore,5301000,719.0,7372.739917
Bahrain,1377000,771.0,1785.992218
Malta,420600,320.0,1314.375000
Maldives,328600,300.0,1095.333333
...,...,...,...
Suriname,538900,163820.0,3.289586
Iceland,332000,103000.0,3.223301
Australia,23210000,7741220.0,2.998235
Namibia,2404000,824290.0,2.916449


In [30]:
countries.sort_index(axis=1)

Unnamed: 0_level_0,alcohol_adults,area,bmi_men,bmi_women,calories_per_day,car_deaths_per_100000_people,eu_accession,income_groups,infant_mortality,is_eu,is_oecd,iso,life_expectancy,life_expectancy_female,life_expectancy_male,population,un_accession,world_4region,world_6region,year
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,0.03,652860.0,20.62,21.07,2090.0,,,low_income,66.3,False,False,AFG,58.69,65.812,63.101,34500000,1946-11-19,asia,south_asia,2018
Albania,7.29,28750.0,26.45,25.66,3193.0,5.978,,upper_middle_income,12.5,False,False,ALB,78.01,80.737,76.693,3238000,1955-12-14,europe,europe_central_asia,2018
Algeria,0.69,2381740.0,24.60,26.37,3296.0,,,upper_middle_income,21.9,False,False,DZA,77.86,77.784,75.279,36980000,1962-10-08,africa,middle_east_north_africa,2018
Andorra,10.17,470.0,27.63,26.43,,,,high_income,2.1,False,False,AND,82.55,,,88910,1993-07-28,europe,europe_central_asia,2017
Angola,5.57,1246700.0,22.25,23.48,2473.0,,,upper_middle_income,96.0,False,False,AGO,65.19,64.939,59.213,20710000,1976-12-01,africa,sub_saharan_africa,2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,7.60,912050.0,27.45,28.13,2631.0,7.332,,upper_middle_income,12.9,False,False,VEN,75.91,79.079,70.950,30340000,1945-11-15,americas,america,2018
Vietnam,3.91,330967.0,20.92,21.07,2745.0,,,lower_middle_income,17.3,False,False,VNM,74.88,81.203,72.003,90660000,1977-09-20,asia,east_asia_pacific,2018
Yemen,0.20,527970.0,24.44,26.11,2223.0,,,lower_middle_income,33.8,False,False,YEM,67.14,66.871,63.875,26360000,1947-09-30,asia,middle_east_north_africa,2018
Zambia,3.56,752610.0,20.68,23.05,1930.0,11.260,,lower_middle_income,43.3,False,False,ZMB,59.45,65.362,59.845,14310000,1964-12-01,africa,sub_saharan_africa,2018


**√ökol:** Kter√© zemƒõ maj√≠ probl√©my s nadv√°hou (pr≈Ømƒõrn√© BMI mu≈æ≈Ø a ≈æen je p≈ôes 25)?

In [31]:
bmi = (countries["bmi_men"] + countries["bmi_women"]) / 2
bmi[bmi > 25].sort_values(ascending=False)

name
Nauru               34.460
Tonga               32.630
Samoa               32.040
Palau               31.115
Marshall Islands    30.380
                     ...  
Kyrgyzstan          25.215
Switzerland         25.135
Malaysia            25.090
Guyana              25.075
Gabon               25.015
Length: 120, dtype: float64

**√ökol:** V kter√Ωch 20 zem√≠ch um≈ôe na svƒõtƒõ nejv√≠c lid√≠ p≈ôi automobilov√Ωch hav√°ri√≠ch?

In [32]:
(countries["population"] * countries["car_deaths_per_100000_people"] / 100000).dropna().astype("int").sort_values(ascending=False).head(20)

name
China                               48788
India                               38683
United States                       30330
Russia                              20505
Iran                                20424
Ethiopia                            16832
Mexico                              11124
Democratic Republic of the Congo    10970
Egypt                                9135
Kenya                                8252
Tanzania                             8134
Bangladesh                           6795
Myanmar                              5894
Turkey                               5417
Niger                                5313
Uganda                               5032
Morocco                              5021
South Africa                         4540
Cameroon                             4278
Sudan                                4237
dtype: int32