# Data Analysis with Pandas

In this seccion we review some of the most commonly used functions when doing data analysis with pandas.

In [None]:
import numpy as np
import pandas as pd

In [None]:
titanic = pd.read_csv('../datasets/titanic/titanic.csv')

In [None]:
titanic.info()

## Missing Values

In [None]:
# Counting missing values
# By column
titanic.isnull().sum(axis=0)

In [None]:
# Dropping rows
titanic.drop(1309, axis=0, inplace=True)
titanic.tail(3)

In [None]:
# Dropping columns
titanic.drop(['ticket','boat','body'], axis=1, inplace=True)
titanic.head(3)

In [None]:
titanic['embarked'].value_counts()

In [None]:
# Most common value for this column
most_popular_embarked = titanic['embarked'].value_counts().idxmax()
most_popular_embarked

In [None]:
# fill the na's with the most popular value for embarked


In [None]:
#Save this variable for later use
titanic['missing_age'] = titanic['age'].isnull()

In [None]:
# Fill the na's from age and fare with the median values

In [None]:
# What was mean fare paid in every class?


## Vectorized string methods

Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:

In [None]:
titanic['name'].str.startswith('A')

In [None]:
titanic[titanic['name'].str.startswith('Al')]

In [None]:
titanic[titanic['home.dest'].str.contains('NY') & titanic['home.dest'].notnull()]

In [None]:
titanic['name'].str.lower()

In [None]:
titanic['name'].str.len()

## Split-Apply-Combine

<b>Step1 (Split): </b> The <i>groupby</i> operation <b><i>splits</b></i> the dataframe into a group of dataframe based on some criteria. Note that the grouped object is <i>not</i> a dataframe. It has a dictionary-like structure and is also iterable.

<img src="img/groupby1.jpg">

<b>Step 2 (Analyze):</b> Once we have a grouped object we can <b><i>apply</b></i> functions or run analysis to each group, set of groups, or the entire group. 

<img src="img/groupby2.jpg">

<b>Step 3 (Combine):</b> We can also <b><i>combine</b></i> the results of the analysis into a new data structure(s). 

<img src="img/groupby3.jpg">

#### Gender differences

In [None]:
titanic.groupby('pclass')['fare'].median()

In [None]:
titanic.groupby('sex').mean()

In [None]:
titanic.groupby('sex').size()

In [None]:
# Calculate the oldest and youngest persons by gender


#### Passanger class differences

In [None]:
# Mean fare by passanger class (grouping)


#### Passanger and gender differences

In [None]:
by_class_gender = titanic.groupby(['pclass','sex']).mean()
by_class_gender

In [None]:
# Group with the highest average age ('oldest group')


# Group with the lowest average age ('youngest group')


In [None]:
# Group with the highest mortality rate


# Group with the lowest mortality rate


## Other useful operations

### Transforming variable types

In [None]:
# pclass from float to int
titanic['pclass'] = titanic['pclass'].astype(int)

In [None]:
# pclass from int to category
titanic['pclass'] = titanic['pclass'].astype('category')

### Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions:

In [None]:
titanic['age_decade'] = pd.cut(titanic['age'], bins=[0,10,20,30,40,50,60,70,80])
titanic['age_decade'].head(10)

In [None]:
titanic['fare_quartile'] = pd.qcut(titanic['fare'], q=[0, .25, .5, .75, 1])
titanic['fare_quartile'].head()

### Row or Column-wise Function Application

In [None]:
def stat_range(series):
    return series.max() - series.min()

In [None]:
titanic[['age','fare']].apply(stat_range)

### Renaming columns

In [None]:
titanic.columns

In [None]:
titanic.rename(columns={'sibsp':'sibilings_spouse','parch':'parents_children'})

### isin
Use this method if you want to know if the values in a Series belong to a list of elements.

In [None]:
titanic['embarked'].isin(['S','C']).head()