# Welcome to the 'First steps with pandas'!

After this workshop you can (hopefully) call yourselves Data Scientists!

### Before coding, let's check whether we have proper versions of libraries

You should have:
- Python: 2.7.10
- numpy: 1.11.1
- pandas: 0.18.1
- matplotlib: 1.5.2

In [1]:
# TODO how strict are we imposing certain versions of libraries? If very strict, we can add assert here.

import platform
print "Python:", platform.python_version()

import numpy as np
print 'numpy:', np.__version__

import pandas as pd
print 'pandas:', pd.__version__
 
import matplotlib as plt
print 'matplotlib:', plt.__version__

Python: 2.7.10
numpy: 1.11.1
pandas: 0.18.1
matplotlib: 1.5.2


## What is pandas?

> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Why to use it?

### We need to be able to develop code quickly and cheaply... and it should be readable

In [2]:
# TODO Make even more impressive example

# Weekly mean views of Michal Pazdan on Wikipedia in Jan '15

# pd.read_json('http://stats.grok.se/json/en/201501/Micha%C5%82_Pazdan') \
pd.read_json('data/cached_wiki_Pazdan.json') \
    .resample('1W') \
    .mean()

Unnamed: 0,daily_views,month,rank
2015-01-04,6.5,201501.0,-1.0
2015-01-11,5.428571,201501.0,-1.0
2015-01-18,6.714286,201501.0,-1.0
2015-01-25,5.857143,201501.0,-1.0
2015-02-01,5.833333,201501.0,-1.0


### We need to develop fast code

In [3]:
# TODO maybe even better example

some_data = list(range(1, 1000000))
some_series = pd.Series(some_data)

def standard_way(data):
    data = [x for x in data if x % 3 == 0]
    return sum(data)


def pandas_way(series):
    return series[(series % 3) == 0].sum()

In [4]:
%timeit standard_way(some_data)

1 loop, best of 3: 174 ms per loop


In [5]:
%timeit pandas_way(some_series)

10 loops, best of 3: 51.3 ms per loop


### It is hard write everything from scratch.. and it's easy to make mistakes.

http://pandas.pydata.org/pandas-docs/stable/api.html

### It can handle nicely missing data (and that's a common case)..

In [56]:
missing_data = pd.DataFrame([
    dict(name="Jacek", height=174),
    dict(name="Mateusz", weight=81),
    dict(name="Lionel Messi", height=169, weight=67)
])
missing_data

Unnamed: 0,height,name,weight
0,174.0,Jacek,
1,,Mateusz,81.0
2,169.0,Lionel Messi,67.0


In [58]:
missing_data.fillna(missing_data.mean())

Unnamed: 0,height,name,weight
0,174.0,Jacek,74.0
1,171.5,Mateusz,81.0
2,169.0,Lionel Messi,67.0


### It has a very cool name.

![caption](files/pandas.jpg)

###  Library highlights

http://pandas.pydata.org/#library-highlights

## So let's start by learning data structures

### Series

> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [6]:
strengths = pd.Series([400, 200, 300, 400, 500])
strengths

0    400
1    200
2    300
3    400
4    500
dtype: int64

In [7]:
names = pd.Series(["Batman", "Robin", "Spiderman", "Robocop", "Terminator"])
names

0        Batman
1         Robin
2     Spiderman
3       Robocop
4    Terminator
dtype: object

### DataFrame

> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

In [8]:
heroes = pd.DataFrame({
    'hero': names,
    'strength': strengths
})
heroes

Unnamed: 0,hero,strength
0,Batman,400
1,Robin,200
2,Spiderman,300
3,Robocop,400
4,Terminator,500


In [9]:
other_heroes = pd.DataFrame([
    dict(hero="Hercules", strength=800),
    dict(hero="Konan")
])
other_heroes

Unnamed: 0,hero,strength
0,Hercules,800.0
1,Konan,


In [30]:
another_heroes = pd.DataFrame([
    pd.Series(["Bolek", 10], index=["hero", "strength"]),
    pd.Series(["Lolek", 20], index=["hero", "strength"])
])
another_heroes

Unnamed: 0,hero,strength
0,Bolek,10
1,Lolek,20


### EXERCISE

Create such DataFrame in 3 different ways

```
                                         movie_title  imdb_score
0                                            Avatar          7.9
1          Pirates of the Caribbean: At World's End          7.1
2                                           Spectre          6.8
```

Help: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-series-or-dicts

TODO: Download all external files and zip them

In [None]:
# TODO solution with dict of series

In [None]:
# TODO solution with list of series

In [10]:
# TODO solution with dict

## Reading files

In [32]:
# Uncomment and press tab..
# pd.read_
# SQL, csv, hdfs

In [33]:
pd.read_csv?

In [43]:
# TODO change this dataset..
!head -n 50 data/cached_wiki_Pazdan.json

{
    "daily_views": {
        "1420070400000": 2,
        "1420156800000": 10,
        "1420243200000": 4,
        "1420329600000": 10,
        "1420416000000": 3,
        "1420502400000": 10,
        "1420588800000": 5,
        "1420675200000": 11,
        "1420761600000": 3,
        "1420848000000": 5,
        "1420934400000": 1,
        "1421020800000": 4,
        "1421107200000": 20,
        "1421193600000": 3,
        "1421280000000": 7,
        "1421366400000": 4,
        "1421452800000": 5,
        "1421539200000": 4,
        "1421625600000": 6,
        "1421712000000": 1,
        "1421798400000": 5,
        "1421884800000": 3,
        "1421971200000": 9,
        "1422057600000": 11,
        "1422144000000": 6,
        "1422230400000": 8,
        "1422316800000": 5,
        "1422403200000": 6,
        "1422489600000": 8,
        "1422576000000": 2,
        "1422662400000": 6
    },
    "month": {
        "1420070400000": 201501,
        "1420

In [45]:
pd.read_json('data/cached_wiki_Pazdan.json').head()

Unnamed: 0,daily_views,month,project,rank,title
2015-01-01,2,201501,en,-1,Michał_Pazdan
2015-01-02,10,201501,en,-1,Michał_Pazdan
2015-01-03,4,201501,en,-1,Michał_Pazdan
2015-01-04,10,201501,en,-1,Michał_Pazdan
2015-01-05,3,201501,en,-1,Michał_Pazdan


### EXERCISE

Load movies from data/movie_metadata.csv and
analyze what dimensions and columns it has..

In [50]:
#E: movies = 

movies = pd.read_csv('data/movie_metadata.csv')
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [51]:
assert movies.shape == (5043, 28)

In [52]:
movies.columns

Index([u'color', u'director_name', u'num_critic_for_reviews', u'duration',
       u'director_facebook_likes', u'actor_3_facebook_likes', u'actor_2_name',
       u'actor_1_facebook_likes', u'gross', u'genres', u'actor_1_name',
       u'movie_title', u'num_voted_users', u'cast_total_facebook_likes',
       u'actor_3_name', u'facenumber_in_poster', u'plot_keywords',
       u'movie_imdb_link', u'num_user_for_reviews', u'language', u'country',
       u'content_rating', u'budget', u'title_year', u'actor_2_facebook_likes',
       u'imdb_score', u'aspect_ratio', u'movie_facebook_likes'],
      dtype='object')

## Filtering

### TODO Skills to include

- [X] know Series - create Series with ... (maybe something related to the workshop audience)
- [X] know DataFrame - create DataFrame in 2 different ways ...
- [X] read files - read file X (maybe csv)
- [ ] filtering of data	filter file in memory with 3 conditions
- [ ] saving data - convert data to list of dicts
- [ ] save data as csv
- [ ] maybe save with to_sql with sqllite"
- [ ] calculating new columns, transforming data using map/apply/applymap	create new columns X, Y, Z
- [ ] groupby, aggregation functions, pipeline API	group by X,Y and aggregate by median/last
- [ ] understanding indexes	
- [ ] datetime operations	stock market / euro state / beer consumption data ?
- [ ] creating simple algorithms using pandas	alerts ? hypes ? stock market KPIs ?
- [ ] delivering working product - wrapping up pandas inside Flask
- [ ] data visulatisation - ploting stuff