# Welcome to the 'First steps with pandas'!

After this workshop you can (hopefully) call yourselves Data Scientists!

### Before coding, let's check whether we have proper versions of libraries

You should have:
- Python: 2.7.10
- numpy: 1.11.1
- pandas: 0.18.1
- matplotlib: 1.5.2

In [16]:
# TODO how strict are we imposing certain versions of libraries? If very strict, we can add assert here.

import platform
print "Python:", platform.python_version()

import numpy as np
print 'numpy:', np.__version__

import pandas as pd
print 'pandas:', pd.__version__
 
import matplotlib as plt
print 'matplotlib:', plt.__version__

Python: 2.7.10
numpy: 1.11.1
pandas: 0.18.1
matplotlib: 1.5.2


## What is pandas?

> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Why to use it?

### We need to be able to develop code quickly and cheaply... and it should be readable

In [10]:
# TODO Make even more impressive example

# Weekly mean views of Michal Pazdan on Wikipedia in Jan '15

# pd.read_json('http://stats.grok.se/json/en/201501/Micha%C5%82_Pazdan') \
pd.read_json('data/cached_wiki_Pazdan.json') \
    .resample('1W') \
    .mean()

Unnamed: 0,daily_views,month,rank
2015-01-04,6.5,201501.0,-1.0
2015-01-11,5.428571,201501.0,-1.0
2015-01-18,6.714286,201501.0,-1.0
2015-01-25,5.857143,201501.0,-1.0
2015-02-01,5.833333,201501.0,-1.0


### We need to develop fast code

In [14]:
# TODO maybe even better example

some_data = list(range(1, 1000000))
some_series = pd.Series(some_data)

def standard_way(data):
    data = [x for x in data if x % 3 == 0]
    return sum(data)


def pandas_way(series):
    return series[(series % 3) == 0].sum()

In [13]:
%timeit standard_way(some_data)

10 loops, best of 3: 142 ms per loop


In [12]:
%timeit pandas_way(some_series)

10 loops, best of 3: 131 ms per loop


### It is hard write everything from scratch.. and it's easy to make mistakes.

http://pandas.pydata.org/pandas-docs/stable/api.html

### It has a very cool name.

![caption](files/pandas.jpg)

## So let's start by learning data structures

### Series

> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [20]:
strengths = pd.Series([400, 200, 300, 400, 500])
strengths

0    400
1    200
2    300
3    400
4    500
dtype: int64

In [21]:
names = pd.Series(["Batman", "Robin", "Spiderman", "Robocop", "Terminator"])
names

0        Batman
1         Robin
2     Spiderman
3       Robocop
4    Terminator
dtype: object

### DataFrame

> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

In [24]:
heroes = pd.DataFrame({
        'hero': names,
        'strength': strengths
})
heroes

Unnamed: 0,hero,strength
0,Batman,400
1,Robin,200
2,Spiderman,300
3,Robocop,400
4,Terminator,500


In [25]:
other_heroes = pd.DataFrame([
    dict(hero="Hercules", strength=800),
    dict(hero="Konan")
])
other_heroes

Unnamed: 0,hero,strength
0,Hercules,800.0
1,Konan,


## EXERCISE

Create a DataFrame with names and heights of your 10 best friends. Calculate median height of your friendship group

In [None]:
# TODO here solution

In [26]:
# TODO Choose dataset(s) for the exercise

### TODO Skills to include

- [X] know Series - create Series with ... (maybe something related to the workshop audience)
- [X] know DataFrame - create DataFrame in 2 different ways ...
- [ ] read files - read file X (maybe json)
- [ ] filtering of data	filter file in memory with 3 conditions
- [ ] saving data - convert data to list of dicts
- [ ] save data as csv
- [ ] maybe save with to_sql with sqllite"
- [ ] calculating new columns, transforming data using map/apply/applymap	create new columns X, Y, Z
- [ ] groupby, aggregation functions, pipeline API	group by X,Y and aggregate by median/last
- [ ] understanding indexes	
- [ ] datetime operations	stock market / euro state / beer consumption data ?
- [ ] creating simple algorithms using pandas	alerts ? hypes ? stock market KPIs ?
- [ ] delivering working product - wrapping up pandas inside Flask
- [ ] data visulatisation - ploting stuff