# Integracia dat

<!--
navody na pouzivanie pandas, matplotlib a numpy na spracovanie dat. Niesu to informacie o tom ako robit explorativnu analyzu, ale ako pouzivat kniznice

Z tohoto povyberam zaujimave casti, spojim ich s nejakou kapitolou v knihe o tom ako riesit spracovanie, cistanie dat a transformovanie dat
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_10_pandas_introduction.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_11_pandas_adding_data.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_12_pandas_groupby.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_13_pandas_movies.ipynb 
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_14_pandas_reshape.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_15_pandas_transforming.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_21_pandas_processing.ipynb
http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_22_pandas_cleaning.ipynb

http://nbviewer.jupyter.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_23_titanic_example.ipynb
-->

In [None]:
from IPython.display import Image
Image('img/ML_Workflow.PNG')

# O com nejdem hovorit

* Nejdem opisovat vsetky mozne record linkage a entity mapping metody (to je minimalne na samostatnu prednasku)
* Nejdem opisovat komplexne ETL nastroje a postupy na spajanie tabuliek a roznych databaz (na to tu mame dokonca samostatny predmet)

# Obsah dnesnej prezentacie

* Intro do pouzivania kniznic Pandas, Matplotlib a Numpy
* Ako pouzit tieto kniznice na zakladne upravovanie formy dat (data cleaning, reshaping, wrangling)
* Velmi lahke zaklady explorativnej analyzy a prace s chybajucimi hodnotami

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn

### Na co nam je Pandas?
* importovanie dat zo standardnych formatov
* vycistit
* pozriet sa do dat (statistiky, sampling, zakladne grafy)
* posunut data na analyzu / trenovanie modelov

### Co je Pandas?
* Python komunita sa inspirovala a ukradla to dobre z `data.frame` struktury v R alebo obdobnych struktur v Matlabe alebo Octave
* Umoznuje zakladne operacie s datami, sampling, group by, merge, ...
* Ako zaklad je pouzite NumPy pole

### Zakladne ulohy
* Spracovanie chybajucich udajov (.dropna(), pd.isnull())
* Merge, join (concat, join)
* Group
* Zmena tvaru dat (pivotovanie) (stack, pivot)
* Praca s casovymi radmi (resampling, timezones, ..)
* Kreslenie

## Nieco k Numpy

In [None]:
pole = [1,2,3]
pole * 3

In [None]:
np_pole = np.array([1,2,3])
np_pole * 3

In [None]:
x = np.arange(20).reshape(4, 5) # skusit viacere dimenzie
x

In [None]:
x.shape

In [None]:
x.ndim

In [None]:
x.sum(axis=1)

## Viacero typov cisel

In [None]:
x.dtype

In [None]:
a = np.array([.1,.2])
print(a)
a.dtype

In [None]:
c = np.array( [ [1,2], [3,4] ], dtype=complex )
print(c)
c.dtype

In [None]:
np.zeros((3,4))

In [None]:
np.ones((2,5))

In [None]:
np.repeat(3, 10).reshape([2,5])

In [None]:
np.linspace(0, 2, 9)

In [None]:
x = np.linspace( 0, 2*np.pi, 100 )
f = np.sin(x)

In [None]:
plt.plot(f)

## Maticove operacie

In [None]:
A = np.array( [[1,1], [0,1]] )
B = np.array( [[2,0], [3,4]] )

In [None]:
A

In [None]:
B

In [None]:
np.transpose(B)

In [None]:
A*B

In [None]:
A.dot(B) # np.dot(A, B)

## Vyberanie prvkov

In [None]:
a = np.arange(10)**3
a

In [None]:
a[2]

In [None]:
a[2:5]

In [None]:
a[2:6:2]

In [None]:
a[:6:2] = -1000
a

In [None]:
a[ : :-1]

## Vyberanie prvkov z viacrozmerneho pola

In [None]:
b = np.arange(20).reshape(4,5)
b

In [None]:
b[2,3]

In [None]:
b[2,]

In [None]:
b[1:3,2:4]

In [None]:
b[:,2:4]

Dalsie operacie si pozrite 
* tu https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
* a tu https://docs.scipy.org/doc/numpy-dev/reference/index.html

## Nejake ukazky k Pandas

Pandas pouziva Numpy pole a nad nim si postavili typ `Series` a `DataFrame`

In [None]:
s = pd.Series([0,1,2,3,4])
s

In [None]:
# k numpy polu je pridany explicitny index 
s.index

In [None]:
s.values

In [None]:
s[0]

In [None]:
# na rozdiel od numpy vsak index moze byt aj nieco ine ako cislo
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2

In [None]:
s2['c']

In [None]:
s2[2]

In [None]:
s2.c

In [None]:
# na vytvorenie Series objektu sa da pouzit aj asociatyvne pole
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9})
population

In [None]:
population['France']

In [None]:
# kedze je to postavene na Numpy, tak vieme robit vsetky zaujimave operacie
population * 1000

In [None]:
# index ma implicitne dane poradie, takze sa da robit rozsah
population['Belgium':'Netherlands']

In [None]:
population.mean()

Da sa pristupovat k prvkom tak, ako sme na to zvyknuti z R

In [None]:
population[['France', 'Netherlands']]

In [None]:
population > 20

In [None]:
population[population > 20]

No a `DataFrame` je vlastne multidimenzionalny `Series`

In [None]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

In [None]:
countries.index

In [None]:
countries.columns

In [None]:
countries.values

In [None]:
countries.dtypes

In [None]:
countries.info()

In [None]:
countries.describe()

In [None]:
categorical = countries.dtypes[countries.dtypes == "object"].index
print(categorical)

countries[categorical].describe()

In [None]:
countries = countries.set_index('country')
countries

a vieme teraz velmi jednoducho pristupovat k jednotlivym stlpcom

In [None]:
countries.area # countries['area']

In [None]:
countries['population']*1000000 / countries['area'] # hustota zaludnenia

In [None]:
# vieme si jednoducho vyrobit novy stlpec
countries['density'] = countries['population']*1000000 / countries['area']
countries

In [None]:
# a na zaklade neho napriklad vyberat riadky
countries[countries['density'] > 300]

In [None]:
# vieme potom napriklad usporiadavat
countries.sort_values(by='density', ascending=False)

In [None]:
# velmi silna vlastnost je priamociare vykreslovanie
countries.population.plot()
# countries.density.plot(kind='bar')
# countries.plot()

In [None]:
countries.plot(kind='scatter', x='population', y='area')

Kedze nam v `DataFrame` pribudla moznost vyberat stlpce podla nazvu, tak sa nam trochu skomplikovalo vyberanie prvkov oproti Numpy. Musime rozoznavat 
* vyberanie podla nazvu a 
* podla pozicie.


In [None]:
countries['area']

In [None]:
countries[['area', 'density']]

In [None]:
# ked ale chceme rozsah, tak nam to pristupuje k riadkom
countries['France':'Netherlands']

Na pokrocilejsie vyberanie z tabulky pouzivame:
* `loc` a
* `iloc`

In [None]:
# pristup ku konkretnej bunke pomocou riadka a stlpca
countries.loc['Germany', 'area']

In [None]:
# tu sa daju pouzit aj rozsahy na oboch rozmeroch
countries.loc['France':'Germany', :]

In [None]:
# ale aj vymenovanie
countries.loc[countries['density']>300, ['capital', 'population']]

In [None]:
# iloc vybera podla poradia. Toto je podobne pristupovaniu k prvkom ako v Numpy
countries.iloc[0:2,1:3]

In [None]:
# samozrejem, ze sa stale daju priradovat hodnoty
countries.loc['Belgium':'Germany', 'population'] = 10
countries

## Zmena tvaru dat pomocou Pandas

In [None]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
# df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4), 'D':range(4)})
df

`unstack` presuva hodnoty v nejakom stlpci a vytvori z nich nazvy stlpcov

casto sa nam to hodi ak mame data, ktore su v trochu unej forme ako by sme potrebovali

In [None]:
Image("img/stack.png")

In [None]:
df = df.set_index(['A', 'B']) # najskor si vyberieme stlpec, ktory pouzijeme ako index. 
# Ten druhy bude dodavat hodnoty do nazvov novych stlpcov
df

In [None]:
# teraz si povieme v ktorom stlpci su hodnoty a nechame to preskupit
result = df['C'].unstack()
result

In [None]:
# opacna transformacia je stack. zoberie nazvy stlpcov a spravi z nich hodnoty
df = result.stack().reset_index(name='C')
df

In [None]:
# pivot je velmi podobny ako unstack, ale necha nastavit mena stlpcov a moze ich byt viac
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df

In [None]:
df.pivot(index='A', columns='B', values='C')

In [None]:
# pivot_table je podobne ako pivot, ale dokaze pracovat s duplicitnymi stlpcami a necha vas definovat agregacnu funkciu
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df

In [None]:
df.pivot_table(index='A', columns='B', values='C', aggfunc=np.sum) #aggfunct je defaultne np.mean

## Ok, skusme sa konecne pohrat s nejakymi datami

In [None]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')
data.head()
# Data su tvorene meraniami nejakej veliciny v jednotlivych hodinach dna. 
# Co den, to riadok. Kazda hodina ma zvlast stlpec + je tu stlpec pre nejaky flag, ktory nas nezaujima
# su tam nejak divne hodnoty, ktore by tam asi nemali byt -999 a -9999
# datum bude asi index
# v subore nieje hlavicka

In [None]:
filename = "data/BETR8010000800100hour.1-1-1990.31-12-2012"

data = pd.read_csv(filename, sep='\t', header=None,
                   na_values=[-999, -9999], index_col=0)
# vela upratovania dat vieme spravit uz pri nacitani
data.head()

In [None]:
# skusime povyhadzovat tie flagy, ktore nas nezaujimaju. Zhodou okolnosti je to kazdy druhy stlpec
data = data.drop(data.columns[1::2], axis=1)
data.head()

In [None]:
["{:02d}".format(i) for i in range(len(data.columns))]

In [None]:
# mam nejako rozsypane nazvy stlpcov
data.columns = ["{:02d}".format(i) for i in range(len(data.columns))]
data.head()

In [None]:
data = data.stack()
data.head()

In [None]:
type(data) # vysledok preusporiadania je viacdimenzionaly Series objekt a nie DataFrame. Ja chcem mat pekny data frame, tak s tim nieco spravime

In [None]:
filename

In [None]:
# mohli by sme nejak normalne poemnovat stlpec
import os
_, fname = os.path.split(filename)
station = fname[:7]
print(filename)
print(station)

In [None]:
data = data.reset_index(name=station) #reset index mi z toho sprav data frame
# data = data.reset_index() #reset index mi z toho sprav data frame
print(type(data))
data.head()

In [None]:
data = data.rename(columns = {0:'date', 'level_1':'hour'})
data.head()

In [None]:
# teraz tomu vyrobime novy index z datumu a hodiny
data.index = pd.to_datetime(data['date'] + ' ' + data['hour'])
data.head()

In [None]:
# a zmazeme nepotrebne stlpce
data = data.drop(['date', 'hour'], axis=1)
data.head()
# Teraz uz mame data, s ktorymi sa uz da nieco robit

Ja mam tych suborov viac. Kazdy obsahuje data z inej meracej stanice. Aby som zjednodusil prezentaciu, tak predchadzajuci kod som dal do cyklu a vlozil do skriptu

In [None]:
import airbase
no2 = airbase.load_data()

In [None]:
no2.head(3)

In [None]:
no2.tail()

In [None]:
no2.info()

In [None]:
no2.describe()

In [None]:
no2.plot(kind='box')

In [None]:
no2['BETN029'].plot(kind='hist', bins=50)

In [None]:
import seaborn

In [None]:
seaborn.violinplot(no2)

In [None]:
no2.plot(figsize=(12,6))
# mozem si vyplotovat surove data, ale je otazne, co mi to povie

In [None]:
# mozem si povedat, ze chcem len nejaku mensiu cast
no2[-500:].plot(figsize=(12,6))

alebo pouzijem zaujimavejsie operacie s casovymi radmi

In [None]:
no2.index # kedze index su casy, tak viem robit s nimi zaujimave veci

In [None]:
no2["2010-01-01 09:00": "2010-01-01 12:00"] # napriklad definovat rozsahy pomocou stringu s datumom

In [None]:
no2['2012'] # alebo takto vybrat vsetky data z jedneho konkretneho roku
# no2['2012'].head()
# no2['2012/03'] # alebo len data z marca

In [None]:
# komponenty datumu su pristupne z indexu
# no2.index.hour
no2.index.year

In [None]:
# a co je zaujimavejsie viem zmenit vzorkovaciu frekvenciu
no2.resample('D').mean().head()

In [None]:
no2.resample('M').mean().plot()
# toto sa zda, ze povie uz trochu viac. Napriklad, ze je tu asi nejaka sezonnost

In [None]:
no2.resample('A').mean().plot()
# a mozno aj nejaky dlhodoby trend

In [None]:
no2['2012-3':'2012-4'].resample('D').mean().plot()
# mozno je tam aj nejaka tyzdenna sezonnost

In [None]:
# mozem pouzit aj viacero agregacnych funkcii a porovnat si ich
no2.loc['2009':, 'FR04037'].resample('M').agg(['mean', 'median']).plot()
# no2.loc['2009':, 'FR04037'].resample('M').agg(['mean', 'std']).plot()

## Dalsia casta operacia je groupby
urcite poznate z SQL

In [None]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

In [None]:
df.groupby('key').aggregate('sum') # df.groupby('key').sum()

In [None]:
no2['month'] = no2.index.month
no2.head()

In [None]:
no2.groupby('month').mean()

In [None]:
no2.groupby('month').mean().plot()

Otazka: ako by ste vyplotovali typycky denny priebeh tejto hodnoty pre rozne stanice?
<!--
no2.groupby(no2.index.hour).mean().plot()
-->

Otazka: aky je rozdiel v priebehu hodnot medzi typickym dnom v tyzdni a cez vikend pre stanicu FR04012?
<!--
no2['weekday'] = no2.index.weekday
no2['weekend'] = no2['weekday'].isin([5, 6])
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend_FR04012 = data_weekend['FR04012'].unstack(level=0)
data_weekend_FR04012.plot()
-->

# Priklad analyzy s pouzitim ineho datasetu
tentokrat to nebudu casove rady, ale klasicky dataset na predvadzanie kalsifikacie Iris

In [None]:
iris_data = pd.read_csv('data/iris-data.csv')
iris_data.head()
# toto je trochu spotvoreny dataset kvetiniek

In [None]:
iris_data.info()

In [None]:
iris_data.describe()

In [None]:
categorical = iris_data.dtypes[iris_data.dtypes == "object"].index
print(categorical)

iris_data[categorical].describe()

In [None]:
seaborn.pairplot(iris_data.dropna(), hue='class')

In [None]:
iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
iris_data.loc[iris_data['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'

iris_data['class'].unique()

In [None]:
seaborn.pairplot(iris_data.dropna(), hue='class')

In [None]:
iris_data.loc[iris_data['class'] == 'Iris-versicolor', 'sepal_length_cm'].hist()

In [None]:
plt.rc("lines", markeredgewidth=0.5)
iris_data.loc[iris_data['class'] == 'Iris-versicolor', 'sepal_length_cm'].plot(kind='box')

In [None]:
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] < 1 ), 'sepal_length_cm']

In [None]:
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] > 1 ), 'sepal_length_cm']

In [None]:
mask = (iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] < 1 )

iris_data.loc[mask, 'sepal_length_cm'] = iris_data.loc[mask, 'sepal_length_cm'] * 100

In [None]:
iris_data.loc[mask, 'sepal_length_cm']

In [None]:
seaborn.pairplot(iris_data.dropna(), hue='class')

## Skusme sa pozriet este na tie chybajuce hodnoty

In [None]:
iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
              (iris_data['sepal_width_cm'].isnull()) |
              (iris_data['petal_length_cm'].isnull()) |
              (iris_data['petal_width_cm'].isnull())]

In [None]:
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'petal_width_cm'].hist()

In [None]:
average_petal_width = iris_data.loc[iris_data['class'] == 'Iris-setosa', 'petal_width_cm'].mean()

iris_data.loc[(iris_data['class'] == 'Iris-setosa') &
              (iris_data['petal_width_cm'].isnull()),
              'petal_width_cm'] = average_petal_width

In [None]:
seaborn.pairplot(iris_data, hue='class')

# Sumar co si zobrat z tejto explorativnej analyzy

* Uisite sa, ze data su kodovane spravne (najcastejsie sa treba pozriet manualne do dat)
* Uistite sa, ze data spadaju do ocakavaneho rozsahu a vsetky maju ocakavany tvar (napriklad format casu)
* Porieste chybajuce data napriklad vyhodenim alebo nahradenim priemerom (priemer musi byt s ohladom na triedu)
* Nikdy nesahajte do dat manualne. Vzdy pouzivajte kod, ktory si odlozite a pouzijete vzdy ked budete opakovat experiment. Chceme aby bola analyza reprodukovatelna
* Spravte si grafy vsetkeho, co sa len da, aby ste si vizualne potvrdili, ze nieco je tak ako by malo byt

## SQL v Pandas

In [None]:
from pandasql import sqldf

In [None]:
from pandasql import load_meat, load_births

meat = load_meat()
births = load_births()

In [None]:
type(meat)

In [None]:
meat.head()

In [None]:
births.head()

In [None]:
data = {'meat': meat}

In [None]:
sqldf('select * from meat limit 10', data)

In [None]:
data2 = {'meat2': meat}

In [None]:
sqldf('select * from meat2 limit 10', data2)

In [None]:
sqldf('select * from meat limit 10', locals())

In [None]:
sqldf('select * from births limit 10', locals())

In [None]:
q = """
    SELECT
        m.date
        , b.births
        , m.beef
    FROM
        meat m
    INNER JOIN
        births b
            on m.date = b.date
    ORDER BY
        m.date
    LIMIT 100;
    """

joined = sqldf(q, locals())
print(joined.head())

Pandasql bezi na SQLite3, takze vsetky klasicke opercaie v SQL viete robit aj tu. Funguju podmienky, vnorene dopyty, joiny, union, funkcie, ...

# Zopar dalsich uzitocnych veci pri praci s Pandas DataFrame

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python_reference/master/Data/some_soccer_data.csv')
df.head()

In [None]:
# premenovanie vybranych stlpcov
df = df.rename(columns={'P': 'points', 
                        'GP': 'games',
                        'SOT': 'shots_on_target',
                        'G': 'goals',
                        'PPG': 'points_per_game',
                        'A': 'assists',})
df.head()

## transformacia hodnot v stlpci

In [None]:
df['SALARY'] = df['SALARY'].apply(lambda x: x.strip('$m'))
df.head()

## Pridanie stlpcu

In [None]:
df['team'] = pd.Series('', index=df.index)
df['position'] = pd.Series('', index=df.index)
df.head()

## Transformacia ineho stlpca  a naplnenie dalsich

In [None]:
def process_player_col(text):
    name, rest = text.split('\n')
    position, team = [x.strip() for x in rest.split(' — ')]
    return pd.Series([name, team, position])

df[['PLAYER', 'team', 'position']] = df.PLAYER.apply(process_player_col)
df.head()

## Zistenie, kolko stlpcov ma prazdne hodnoty

In [None]:
df.shape[0] - df.dropna().shape[0]

## Vyber riadkov, kde su prazdne hodnoty

In [None]:
df[df['assists'].isnull()]

## Vyber plnych riadkov

In [None]:
df[df['assists'].notnull()]
# df[~df['assists'].isnull()]

## Nahradzanie prazdnych hodnot

In [None]:
# predtym sme to robili manulane. 
# iris_data.loc[(iris_data['class'] == 'Iris-setosa') & (iris_data['petal_width_cm'].isnull()), 'petal_width_cm'] = average_petal_width

# Da sa na to pouzit takato pekna funkcia
df.fillna(value=0, inplace=True)
df

## Existuje vsak este elegantnejsi sposob

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python_reference/master/Data/some_soccer_data.csv')
df = df.rename(columns={'P': 'points', 
                        'GP': 'games',
                        'SOT': 'shots_on_target',
                        'G': 'goals',
                        'PPG': 'points_per_game',
                        'A': 'assists',})
df['SALARY'] = df['SALARY'].apply(lambda x: x.strip('$m'))
df[['PLAYER', 'team', 'position']] = df.PLAYER.apply(process_player_col)            
               
df.head()

In [None]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
df[['games', 'assists']] = imp.fit_transform(df[['games', 'assists']].values)
df.head()

Pozor, toto doplnanie neberie do uvahy triedu

In [None]:
df.games.mean()

In [None]:
df[df.position == 'Forward'].games.mean()

## Spajanie podmienok

In [None]:
df[ (df['team'] == 'Arsenal') | (df['team'] == 'Chelsea') ]

In [None]:
df[ (df['team'] == 'Arsenal') & (df['position'] == 'Forward') ]

# Nejake zdroje na studium
* http://nbviewer.jupyter.org/format/slides/github/jorisvandenbossche/2015-PyDataParis/blob/master/pandas_introduction.ipynb
* http://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb
* [Pandas Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf), [nejaky komentar k tomu](http://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html)

# Nejake dalsie nastroje
* [OpenRefine](http://openrefine.org/) - standalone nastroj na cistenie a pozeranie sa do dat
* [Trifacta](https://www.trifacta.com/products/wrangler/)