# Pandas Review

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

You can find it here: http://pandas.pydata.org/

And the documentation can be found here: http://pandas.pydata.org/pandas-docs/stable/

In this notebook we review some of its functionality.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("../data/titanic-train.csv")

## Quick exploration

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## New in 0.19: category parsing

In [None]:
df = pd.read_csv("../data/titanic-train.csv",
                 dtype={'Pclass': 'category',
                        'Sex': 'category',
                        'Embarked': 'category'}
                )

In [None]:
df.head()

In [None]:
df.info()

## Indexing

Try to figure out what each of the following indexing method does.

If in trouble check here: http://pandas.pydata.org/pandas-docs/stable/indexing.html

In [None]:
df.ix[0]

In [None]:
df.iloc[3]

In [None]:
df.loc[0:4,'Ticket']

In [None]:
df['Ticket'].head()

## Selections


Try to figure out what each of the following indexing method does.

If in trouble check here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking

In [None]:
df[df.Age > 70]

In [None]:
age = df['Age']
age.where(age > 30).head()

In [None]:
df[(df['Age'] == 11) & (df['SibSp'] == 5)]

In [None]:
df[(df.Age == 11) | (df.SibSp == 5)]

In [None]:
df.query('(Age == 11) & (SibSp == 5)')

## Distinct elements

In [None]:
df['Embarked'].unique()

## Group by

Pandas supports many SQL-like operations like group by, order by and join. In pandas they are called:
- groupby
- sort_values
- merge

See some examples below and refer to:

http://pandas.pydata.org/pandas-docs/stable/groupby.html

http://pandas.pydata.org/pandas-docs/stable/merging.html

In [None]:
# Find average age of passengers that survived vs. died
df.groupby('Survived')['Age'].mean()

In [None]:
df.sort_values('Age', ascending = False).head()

In [None]:
pd.merge(df[['PassengerId', 'Survived']],
         df[['PassengerId', 'Age']],
         on='PassengerId').head()

## Pivot Tables

Pandas also supports Excel-like functionality like pivot tables

see: http://pandas.pydata.org/pandas-docs/stable/reshaping.html

In [None]:
df.pivot_table(index='Pclass', columns='Survived', values='PassengerId', aggfunc='count')

In [None]:
df['Pclass'].value_counts()

Theres much more that Pandas can do for you. Make sure to check the documentation: http://pandas.pydata.org/pandas-docs/stable/

## Exercises:

- select passengers that survived
- select passengers that embarked in port S
- select male passengers
- select passengers who paid less than 40.000 and were in third class
- locate the name of passegner Id 674
- calculate the average age of passengers using the function mean()
- count the number of survived and the number of dead passengers
- count the number of males and females
- count the number of survived and dead per each gender
- calculate average price paid by survived and dead people


*Copyright &copy; 2015 Dataweekends.  All rights reserved.*