# Analiza podatkov s pandas

[Pandas quick-start guide](http://pandas.pydata.org/pandas-docs/stable/10min.html)  
[Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)  
[Lecture notes on pandas](../predavanja/Analiza podatkov s knjižnico Pandas.ipynb)


### Naložimo pandas in podatke

In [5]:
# naložimo paket
import pandas as pd
import numpy as np

import os.path


# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 10 vrstic
pd.options.display.max_rows = 10

# izberemo interaktivni "notebook" stil risanja
%matplotlib notebook

In [19]:
# naložimo razpredelnico, s katero bomo delali
pot = os.path.join(
    '../../', '02-zajem-podatkov', 
    'predavanja', 'obdelani-podatki', 'filmi.csv'
)

filmi = pd.read_csv(pot, index_col='id')

Poglejmo si podatke.

In [20]:
filmi

Unnamed: 0_level_0,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
4972,The Birth of a Nation,195,1915,6.3,,22363,10000000.0,,The Stoneman family finds its friendship with ...
6864,Intolerance: Love's Struggle Throughout the Ages,163,1916,7.7,99.0,13970,2180000.0,,"The story of a poor young woman, separated by ..."
9968,Broken Blossoms or The Yellow Man and the Girl,90,1919,7.3,,9296,,,"A frail waif, abused by her brutal boxer fathe..."
10323,Das Cabinet des Dr. Caligari,76,1920,8.1,,56089,,,"Hypnotist Dr. Caligari uses a somnambulist, Ce..."
12349,The Kid,68,1921,8.3,,110278,5450000.0,,"The Tramp cares for an abandoned child, but ev..."
...,...,...,...,...,...,...,...,...,...
11390036,A Fall from Grace,115,2020,5.8,34.0,10414,,,"Disheartened since her ex-husband's affair, Gr..."
11905962,Sputnik,113,2020,6.3,61.0,8285,,,The lone survivor of an enigmatic spaceship in...
12393526,Bulbbul,94,2020,6.6,,8381,,,A man returns home after years to find his bro...
12567088,Raat Akeli Hai,149,2020,7.3,,12232,,,The film follows a small town cop who is summo...


## Proučevanje podatkov

Razvrstite podatke po ocenah.

In [23]:
filmi.sort_values('ocena', ascending=False)

Unnamed: 0_level_0,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
252487,Hababam Sinifi,87,1975,9.3,,36468,,,"Lazy, uneducated students share a very close b..."
111161,Kaznilnica odrešitve,142,1994,9.3,80.0,2293163,28341469.0,R,Two imprisoned men bond over a number of years...
68646,Boter,175,1972,9.2,100.0,1582906,134966411.0,,The aging patriarch of an organized crime dyna...
5354160,Aynabaji,147,2016,9.1,,21429,,,Ayna is an actor and the prison is his stage. ...
7738784,Peranbu,147,2018,9.0,,11866,,,"A single father tries to raise his daughter, w..."
...,...,...,...,...,...,...,...,...,...
5988370,Reis,108,2017,1.4,,72207,,,A drama about the early life of Recep Tayyip E...
6038600,Smolensk,120,2016,1.4,,7630,,,Inspired by true events of 2010 Polish Air For...
4009460,Saving Christmas,79,2014,1.4,18.0,14855,2783970.0,PG,His annual Christmas party faltering thanks to...
7886848,Sadak 2,133,2020,1.1,,57957,,,"The film picks up where Sadak left off, revolv..."


Poberite stolpec ocen.

In [30]:
filmi['ocena']

id
4972        6.3
6864        7.7
9968        7.3
10323       8.1
12349       8.3
           ... 
11390036    5.8
11905962    6.3
12393526    6.6
12567088    7.3
12749596    6.6
Name: ocena, Length: 10000, dtype: float64

In [31]:
filmi[['naslov', 'ocena']]

Unnamed: 0_level_0,naslov,ocena
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4972,The Birth of a Nation,6.3
6864,Intolerance: Love's Struggle Throughout the Ages,7.7
9968,Broken Blossoms or The Yellow Man and the Girl,7.3
10323,Das Cabinet des Dr. Caligari,8.1
12349,The Kid,8.3
...,...,...
11390036,A Fall from Grace,5.8
11905962,Sputnik,6.3
12393526,Bulbbul,6.6
12567088,Raat Akeli Hai,7.3


Ukaza `filmi['ocena']` in `filmi[['ocena']]` sta različna:

In [32]:
print(type(filmi['ocena']))
print(type(filmi[['ocena']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


Stolpci objekta `DataFrame` so tipa `Series`. Z enojnimi oklepaji poberemo `Series`, z dvojnimi oklepaji pa `DataFrame` podtabelo. Večina operacij (grouping, joining, plotting,  filtering, ...) deluje na `DataFrame`. 

Tip `Series` se uporablja ko želimo npr. dodati stolpec.

Zaokrožite stolpec ocen z funkcijo `round()`.

In [37]:
ocene_zaokrozene = round(filmi['ocena'])
ocene_zaokrozene

id
4972        6.0
6864        8.0
9968        7.0
10323       8.0
12349       8.0
           ... 
11390036    6.0
11905962    6.0
12393526    7.0
12567088    7.0
12749596    7.0
Name: ocena, Length: 10000, dtype: float64

Dodajte zaokrožene vrednosti v podatke.

In [39]:
filmi['ocena_lepse'] = ocene_zaokrozene
filmi

Unnamed: 0_level_0,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis,ocena_lepse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4972,The Birth of a Nation,195,1915,6.3,,22363,10000000.0,,The Stoneman family finds its friendship with ...,6.0
6864,Intolerance: Love's Struggle Throughout the Ages,163,1916,7.7,99.0,13970,2180000.0,,"The story of a poor young woman, separated by ...",8.0
9968,Broken Blossoms or The Yellow Man and the Girl,90,1919,7.3,,9296,,,"A frail waif, abused by her brutal boxer fathe...",7.0
10323,Das Cabinet des Dr. Caligari,76,1920,8.1,,56089,,,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.0
12349,The Kid,68,1921,8.3,,110278,5450000.0,,"The Tramp cares for an abandoned child, but ev...",8.0
...,...,...,...,...,...,...,...,...,...,...
11390036,A Fall from Grace,115,2020,5.8,34.0,10414,,,"Disheartened since her ex-husband's affair, Gr...",6.0
11905962,Sputnik,113,2020,6.3,61.0,8285,,,The lone survivor of an enigmatic spaceship in...,6.0
12393526,Bulbbul,94,2020,6.6,,8381,,,A man returns home after years to find his bro...,7.0
12567088,Raat Akeli Hai,149,2020,7.3,,12232,,,The film follows a small town cop who is summo...,7.0


Odstranite novo dodani stolpec z metodo `.drop()` z podanim `columns = ` argumentom.

In [46]:
filmi = filmi.drop(columns = 'ocena')
filmi

Unnamed: 0_level_0,naslov,dolzina,leto,metascore,glasovi,zasluzek,oznaka,opis,ocena_lepse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
4972,The Birth of a Nation,195,1915,,22363,10000000.0,,The Stoneman family finds its friendship with ...,6.0
6864,Intolerance: Love's Struggle Throughout the Ages,163,1916,99.0,13970,2180000.0,,"The story of a poor young woman, separated by ...",8.0
9968,Broken Blossoms or The Yellow Man and the Girl,90,1919,,9296,,,"A frail waif, abused by her brutal boxer fathe...",7.0
10323,Das Cabinet des Dr. Caligari,76,1920,,56089,,,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.0
12349,The Kid,68,1921,,110278,5450000.0,,"The Tramp cares for an abandoned child, but ev...",8.0
...,...,...,...,...,...,...,...,...,...
11390036,A Fall from Grace,115,2020,34.0,10414,,,"Disheartened since her ex-husband's affair, Gr...",6.0
11905962,Sputnik,113,2020,61.0,8285,,,The lone survivor of an enigmatic spaceship in...,6.0
12393526,Bulbbul,94,2020,,8381,,,A man returns home after years to find his bro...,7.0
12567088,Raat Akeli Hai,149,2020,,12232,,,The film follows a small town cop who is summo...,7.0


### Opomba: slice
Izbira podtabele ustvari t.i. "rezino" oz. "slice".
Slice ni kopija tabele, temveč zgolj sklic na izvorno tabelo,
in je zato ne moremo spreminjati.
Če želimo kopijo, uporabimo metodo `.copy()` na rezini, ki jo nato lahko spreminjamo.


Izberite podtabelo s stolpci `naslov`, `leto`, in `glasovi`, kateri nato dodate solpec z zaokroženimi ocenami.

In [52]:
zanimivo = (filmi[['naslov', 'leto', 'glasovi', 'ocena_lepse']]).copy()
zanimivo['ocena_lepse'] = zanimivo['ocena_lepse'].astype(int)
zanimivo

Unnamed: 0_level_0,naslov,leto,glasovi,ocena_lepse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4972,The Birth of a Nation,1915,22363,6
6864,Intolerance: Love's Struggle Throughout the Ages,1916,13970,8
9968,Broken Blossoms or The Yellow Man and the Girl,1919,9296,7
10323,Das Cabinet des Dr. Caligari,1920,56089,8
12349,The Kid,1921,110278,8
...,...,...,...,...
11390036,A Fall from Grace,2020,10414,6
11905962,Sputnik,2020,8285,6
12393526,Bulbbul,2020,8381,7
12567088,Raat Akeli Hai,2020,12232,7


### Filtracija

Ustvarite filter, ki izbere filme, ki so izšli pred 1930, in filter za filme po 2017.
Združite ju za izbor filmov, ki so izšli pred 1930 ali po 2017.

In [53]:
prej = filmi['leto'] < 1930
potem = filmi['leto'] > 2017
izbrani = filmi[prej | potem]
izbrani

Unnamed: 0_level_0,naslov,dolzina,leto,metascore,glasovi,zasluzek,oznaka,opis,ocena_lepse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
4972,The Birth of a Nation,195,1915,,22363,10000000.0,,The Stoneman family finds its friendship with ...,6.0
6864,Intolerance: Love's Struggle Throughout the Ages,163,1916,99.0,13970,2180000.0,,"The story of a poor young woman, separated by ...",8.0
9968,Broken Blossoms or The Yellow Man and the Girl,90,1919,,9296,,,"A frail waif, abused by her brutal boxer fathe...",7.0
10323,Das Cabinet des Dr. Caligari,76,1920,,56089,,,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.0
12349,The Kid,68,1921,,110278,5450000.0,,"The Tramp cares for an abandoned child, but ev...",8.0
...,...,...,...,...,...,...,...,...,...
11390036,A Fall from Grace,115,2020,34.0,10414,,,"Disheartened since her ex-husband's affair, Gr...",6.0
11905962,Sputnik,113,2020,61.0,8285,,,The lone survivor of an enigmatic spaceship in...,6.0
12393526,Bulbbul,94,2020,,8381,,,A man returns home after years to find his bro...,7.0
12567088,Raat Akeli Hai,149,2020,,12232,,,The film follows a small town cop who is summo...,7.0


Definirajte funkcijo, ki preveri ali niz vsebuje kvečjemu dve besedi. Nato s pomočjo `.apply()` izberite vse filme z imeni daljšimi od dveh besed in oceno nad 8.

In [56]:
dobra_ocena = filmi['ocena_lepse'] > 8

def daljsi_od_dveh_besed(naslov):
    return len(naslov.split()) > 2

dolga_imena = filmi['naslov'].apply(daljsi_od_dveh_besed)

dobri_filmi = filmi[dobra_ocena & dolga_imena]
dobri_filmi

Unnamed: 0_level_0,naslov,dolzina,leto,metascore,glasovi,zasluzek,oznaka,opis,ocena_lepse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
50083,12 jeznih mož,96,1957,96.0,673688,4360000.0,,A jury holdout attempts to prevent a miscarria...,9.0
59550,Operatsiya 'Y' i drugie priklyucheniya Shurika,95,1965,,11693,,,Three stories about Shurik - a young student. ...,9.0
60196,"Dober, grd, hudoben",161,1966,90.0,676502,6100000.0,,A bounty hunting scam joins two men in an unea...,9.0
71562,"Boter, II. del",202,1974,90.0,1105879,57300000.0,,The early life and career of Vito Corleone in ...,9.0
73486,Let nad kukavičjim gnezdom,133,1975,83.0,899748,112000000.0,,A criminal pleads insanity and is admitted to ...,9.0
...,...,...,...,...,...,...,...,...,...
253614,Saban Oglu Saban,90,1977,,15852,,,Husamettin the commander in the army is consta...,9.0
263975,Selvi Boylum Al Yazmalim,90,1977,,14502,,,Story of a dilemma between a woman's love and ...,9.0
271383,O Auto da Compadecida,104,2000,,11452,,,João Grilo and Chicó are two very poor and cle...,9.0
317248,Cidade de Deus,130,2002,79.0,689090,7563397.0,R,"In the slums of Rio, two kids' paths diverge a...",9.0


### Histogrami

Združite filme po ocenah in jih preštejte.

In [58]:
filmi.groupby('ocena_lepse').size()

ocena_lepse
1.0       5
2.0      40
3.0      56
4.0     221
5.0     829
6.0    3220
7.0    3534
8.0    2029
9.0      66
dtype: int64

Naredite stolpični diagram teh podatkov.

In [59]:
filmi.groupby('ocena_lepse').size().plot.bar()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='ocena_lepse'>

Tabele imajo metodo `.hist()`, ki omogoča izgradnjo histogramov za stolpce. Uporabite to metodo za prikaz poenostavljenih podatkov.

In [67]:
filmi.groupby('ocena_lepse').hist()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

ocena_lepse
1.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
2.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
3.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
4.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
5.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
6.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
7.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
8.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
9.0    [[AxesSubplot(0.125,0.666111;0.336957x0.213889...
dtype: object

### Izris povprečne dolžine filma glede na leto

In [71]:
filmi.groupby('leto').mean()[['dolzina']].plot()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='leto'>

### Izris skupnega zasluzka za posamezno leto

In [73]:
filmi.groupby('leto').sum()[['zasluzek']].plot()

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='leto'>