# Intro to Pandas and Dask
### Analyse easily  your data in Pandas and scale up with Dask

## Agenda
### Basics of Pandas
* Pandas data structures
* Pandas operations
* Example: CSV file

### Basics of Dask
* Dask data structures
* Dask Functions
* Example: read data out of RAM

## Objectives
### Learn the basics of Pandas
* DataFrame & Series
* Read & Write a CSV
* Explore data (data stats and plot)

### Learn the basics of Dask
* Dask data structures
* lazy delayed functions
* Dask DataFrame

# Pandas basic data structures

dive straight into an example and then exaplain the structure

In [1]:
import pandas as pd
import numpy as np

Let's generate some random data

In [2]:
np.random.seed(493982)
data = np.random.rand(20, 5)
data

array([[0.74654724, 0.99788419, 0.86591949, 0.41695469, 0.8725506 ],
       [0.87068064, 0.66459727, 0.44992557, 0.26986815, 0.25715698],
       [0.11855892, 0.69057913, 0.65974659, 0.12228396, 0.35230689],
       [0.83246491, 0.41042833, 0.83468465, 0.67801725, 0.13328298],
       [0.63107085, 0.11166736, 0.3768215 , 0.5050874 , 0.50397578],
       [0.09142064, 0.41563447, 0.01123665, 0.42177774, 0.53645722],
       [0.8964898 , 0.61909905, 0.64485439, 0.45374511, 0.30419314],
       [0.82762957, 0.60506176, 0.22260624, 0.33932104, 0.30457279],
       [0.55414776, 0.92551271, 0.07917241, 0.22934229, 0.70126891],
       [0.00580176, 0.56441064, 0.71657418, 0.52606646, 0.70951076],
       [0.67113566, 0.94523831, 0.90580676, 0.38760335, 0.11510539],
       [0.85258045, 0.12584943, 0.73929132, 0.92034466, 0.21199683],
       [0.38914323, 0.59220176, 0.62593959, 0.86293186, 0.42646922],
       [0.74097959, 0.88157047, 0.87566306, 0.7656533 , 0.23863027],
       [0.42049234, 0.12083495, 0.

### *The Series*

In [3]:
ser = pd.Series(data[0])
ser

0    0.746547
1    0.997884
2    0.865919
3    0.416955
4    0.872551
dtype: float64

### *The DataFrame*

In [4]:
df = pd.DataFrame(data)
df

Unnamed: 0,0,1,2,3,4
0,0.746547,0.997884,0.865919,0.416955,0.872551
1,0.870681,0.664597,0.449926,0.269868,0.257157
2,0.118559,0.690579,0.659747,0.122284,0.352307
3,0.832465,0.410428,0.834685,0.678017,0.133283
4,0.631071,0.111667,0.376822,0.505087,0.503976
5,0.091421,0.415634,0.011237,0.421778,0.536457
6,0.89649,0.619099,0.644854,0.453745,0.304193
7,0.82763,0.605062,0.222606,0.339321,0.304573
8,0.554148,0.925513,0.079172,0.229342,0.701269
9,0.005802,0.564411,0.716574,0.526066,0.709511


## Pandas Operations

* Accessing Dataframe elements
* Manipulate Columns and Index
* Selection

In [5]:
df.head()

Unnamed: 0,0,1,2,3,4
0,0.746547,0.997884,0.865919,0.416955,0.872551
1,0.870681,0.664597,0.449926,0.269868,0.257157
2,0.118559,0.690579,0.659747,0.122284,0.352307
3,0.832465,0.410428,0.834685,0.678017,0.133283
4,0.631071,0.111667,0.376822,0.505087,0.503976


In [6]:
df.tail(2)

Unnamed: 0,0,1,2,3,4
18,0.330712,0.3509,0.327098,0.5797,0.613374
19,0.482239,0.815541,0.500151,0.722879,0.41454


In [7]:
df.columns = ['col' + str(x) for x in range(len(df.columns))]
df.head()

Unnamed: 0,col0,col1,col2,col3,col4
0,0.746547,0.997884,0.865919,0.416955,0.872551
1,0.870681,0.664597,0.449926,0.269868,0.257157
2,0.118559,0.690579,0.659747,0.122284,0.352307
3,0.832465,0.410428,0.834685,0.678017,0.133283
4,0.631071,0.111667,0.376822,0.505087,0.503976


In [8]:
df.index.name = 'my-idx'

In [9]:
df.head()

Unnamed: 0_level_0,col0,col1,col2,col3,col4
my-idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.746547,0.997884,0.865919,0.416955,0.872551
1,0.870681,0.664597,0.449926,0.269868,0.257157
2,0.118559,0.690579,0.659747,0.122284,0.352307
3,0.832465,0.410428,0.834685,0.678017,0.133283
4,0.631071,0.111667,0.376822,0.505087,0.503976


In [10]:
df.index

RangeIndex(start=0, stop=20, step=1, name='my-idx')

In [11]:
df.columns

Index(['col0', 'col1', 'col2', 'col3', 'col4'], dtype='object')

In [12]:
df['col0']

my-idx
0     0.746547
1     0.870681
2     0.118559
3     0.832465
4     0.631071
5     0.091421
6     0.896490
7     0.827630
8     0.554148
9     0.005802
10    0.671136
11    0.852580
12    0.389143
13    0.740980
14    0.420492
15    0.528703
16    0.610622
17    0.258126
18    0.330712
19    0.482239
Name: col0, dtype: float64

In [13]:
df[['col0','col3']].head()

Unnamed: 0_level_0,col0,col3
my-idx,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.746547,0.416955
1,0.870681,0.269868
2,0.118559,0.122284
3,0.832465,0.678017
4,0.631071,0.505087


In [14]:
df = df[['col0','col3']].head()
df.loc[df['col0'] > 0.7, :]

Unnamed: 0_level_0,col0,col3
my-idx,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.746547,0.416955
1,0.870681,0.269868
3,0.832465,0.678017


#### Explanation
`.loc` = Select

`dfnew.loc[df['col0'] > 0.7, :]`

Select `WHERE col0 > 0.7`

In fact `df['col0'] > 0.7`

In [15]:
df['col0'] > 0.7

my-idx
0     True
1     True
2    False
3     True
4    False
Name: col0, dtype: bool

## Example: Loading a CSV and analyse the time series

## Links
### Create slides using Jupyter Notebook
https://medium.com/learning-machine-learning/present-your-data-science-projects-with-jupyter-slides-75f20735eb0f