---

<img src="./images/anchormen-logo.png" width="500">

---

# Basic Pandas Operations

This Notebook outlines some basic operations in pandas. The following topics will be covered:

- A. **Creating DataFrames**
- B. **Selecting and Sorting**
- C. **Modifying Data**
- D. **Grouped Operations**
- E. **Long/Wide Conversions**
- F. **Combining DataFrames**

## Pandas intro

`pandas` is one of the most used Python libraries for data manipulation and analysis and is built on top of `NumPy`. `NumPy` by itself is a fairly low-level tool, and will be very much similar to using MATLAB. pandas on the other hand provides rich time series functionality, data alignment, NA-friendly statistics, groupby, merge and join methods, and lots of other conveniences.

`pandas` was created by Wes McKinney, also the author of 'Python for Data Analysis' (see references).


### Data Structures

The most important data structures in `pandas` are:

- `DataFrames`: array-like structure, which consists of columns that are pandas `Series`
- `Series`: one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, ..)

The focus of this notebook will be on working with `DataFrames`.

## A: Creating DataFrames

Let's start by importing some libraries. Our focus will be on the `pandas` library.

In [1]:
import pandas as pd
import numpy as np

### Creating DataFrames

Dataframes can be constructed in many ways, from a dictionary of equal length lists, from a excel file etc.

In [2]:
# One way to create a pandas DataFrame 
df = pd.DataFrame(np.random.randn(4,4), 
                  columns=['var1', 'var2', 'var3', 'var4'], 
                  index = ['observation1', 'observation2', 'observation3', 'observation4'])

In [3]:
type(df)

pandas.core.frame.DataFrame

In [4]:
df

Unnamed: 0,var1,var2,var3,var4
observation1,-0.220395,1.146229,0.443293,-1.224787
observation2,-0.369083,0.686999,-1.071056,-0.86319
observation3,-1.257075,-2.091312,-0.65112,-0.532363
observation4,-0.674821,0.063806,0.334175,-0.170289


### Indexing

In [5]:
# It's possible to set the index to one of the columns
df.set_index('var1', inplace=True)
df

Unnamed: 0_level_0,var2,var3,var4
var1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-0.220395,1.146229,0.443293,-1.224787
-0.369083,0.686999,-1.071056,-0.86319
-1.257075,-2.091312,-0.65112,-0.532363
-0.674821,0.063806,0.334175,-0.170289


In [6]:
# Resetting the index again
df.reset_index()

Unnamed: 0,var1,var2,var3,var4
0,-0.220395,1.146229,0.443293,-1.224787
1,-0.369083,0.686999,-1.071056,-0.86319
2,-1.257075,-2.091312,-0.65112,-0.532363
3,-0.674821,0.063806,0.334175,-0.170289


In [7]:
# Create indexed dataframe again before we continue
df = pd.DataFrame(np.random.randn(4,4), 
                  columns=['var1', 'var2', 'var3', 'var4'], 
                  index = ['observation1', 'observation2', 'observation3', 'observation4'])

### Summary Part A (Creating DataFrames)

**We learned how to**:

- import the pandas library
- create a DataFrame
- set and reset an index on the DataFrame

**And encountered the following pandas data type**:

- pandas.core.frame.DataFrame

**Making use of the following commands and methods**:
```
- import pandas as pd
- pd.DataFrame()
- df.set_index()
- df.reset_index()
```

##  B: Selecting and Sorting

### Column Selection

In [8]:
# Select the first column 
df['var1']

observation1   -1.033894
observation2   -0.166223
observation3    0.788574
observation4   -1.210072
Name: var1, dtype: float64

In [9]:
# Note that DataFrame columns are of type Series
type(df['var1'])

pandas.core.series.Series

In [10]:
# Select multiple columns, pass in a list with column names
df[['var1', 'var3']]

Unnamed: 0,var1,var3
observation1,-1.033894,-1.421882
observation2,-0.166223,0.563013
observation3,0.788574,0.543387
observation4,-1.210072,-0.232154


### Row Selection

In [11]:
# Select a row 
df.loc['observation4']

var1   -1.210072
var2   -0.732834
var3   -0.232154
var4   -0.553107
Name: observation4, dtype: float64

In [12]:
# Select a row based on index position
df.iloc[2]

var1    0.788574
var2   -1.009404
var3    0.543387
var4   -0.063707
Name: observation3, dtype: float64

### Select combination of rows and columns

In [13]:
# Select subset of rows and columns 
df.loc['observation4', 'var2']

-0.7328341829307481

In [14]:
# Select multiple specific rows and columns 
df.loc[['observation3', 'observation4'], ['var1', 'var2','var3']]

Unnamed: 0,var1,var2,var3
observation3,0.788574,-1.009404,0.543387
observation4,-1.210072,-0.732834,-0.232154


### Conditional Selection 

In [15]:
# Complete dataframe
df

Unnamed: 0,var1,var2,var3,var4
observation1,-1.033894,2.436842,-1.421882,-1.416353
observation2,-0.166223,-0.09211,0.563013,0.908918
observation3,0.788574,-1.009404,0.543387,-0.063707
observation4,-1.210072,-0.732834,-0.232154,-0.553107


In [16]:
# Condition
df['var1']>-1

observation1    False
observation2     True
observation3     True
observation4    False
Name: var1, dtype: bool

In [17]:
# Select rows with condition True
df[df['var1']>-1]

Unnamed: 0,var1,var2,var3,var4
observation2,-0.166223,-0.09211,0.563013,0.908918
observation3,0.788574,-1.009404,0.543387,-0.063707


In [18]:
# Logical conditions
df[(df['var1']<2) & (df['var4']>=0)]

Unnamed: 0,var1,var2,var3,var4
observation2,-0.166223,-0.09211,0.563013,0.908918


### Sorting

In [19]:
df

Unnamed: 0,var1,var2,var3,var4
observation1,-1.033894,2.436842,-1.421882,-1.416353
observation2,-0.166223,-0.09211,0.563013,0.908918
observation3,0.788574,-1.009404,0.543387,-0.063707
observation4,-1.210072,-0.732834,-0.232154,-0.553107


In [20]:
df.sort_values(by='var1')

Unnamed: 0,var1,var2,var3,var4
observation4,-1.210072,-0.732834,-0.232154,-0.553107
observation1,-1.033894,2.436842,-1.421882,-1.416353
observation2,-0.166223,-0.09211,0.563013,0.908918
observation3,0.788574,-1.009404,0.543387,-0.063707


### Unique Values

In [21]:
# Create new dataframe
df2 = pd.DataFrame(np.random.randint(5, size=[4,4]),
                   columns=['var1', 'var2', 'var3', 'var4'], 
                   index = ['observation1', 'observation2', 'observation3', 'observation4'])

In [22]:
df2

Unnamed: 0,var1,var2,var3,var4
observation1,3,0,2,4
observation2,3,0,0,1
observation3,3,1,3,2
observation4,4,0,4,1


In [23]:
# Unique values 
df2['var4'].unique()

array([4, 1, 2])

In [24]:
# Number of unique values 
df2['var4'].nunique()

3

In [25]:
# For each unique value, count the number
df2['var4'].value_counts()

1    2
2    1
4    1
Name: var4, dtype: int64

### Summary Part B (Selecting and Sorting)

**We learned**:

- how to select columns, rows and combinations thereof within a pandas DataFrame
- how to sort by specific columns
- how to extract unique values and value counts of colums
- that columns in a pandas DataFrame are pandas Series objects

**Encountered the following pandas data type**:

- pandas.core.series.Series

**And making use of the following commands and methods**:
```
- df[['var1', 'var3']]
- df.loc[['observation3', 'observation4'], ['var1', 'var2','var3']]
- df.iloc[2]
- df[(df['var1']<2) & (df['var4']>=0)]
- df.sort_values(by='var1')
- df['var1'].unique()
- df['var1'].nunique()
- df['var1'].value_counts()
```

## C. Modifying Data

### Assigning values

In [26]:
df.loc[['observation2','observation4'],'var3'] = 1
df

Unnamed: 0,var1,var2,var3,var4
observation1,-1.033894,2.436842,-1.421882,-1.416353
observation2,-0.166223,-0.09211,1.0,0.908918
observation3,0.788574,-1.009404,0.543387,-0.063707
observation4,-1.210072,-0.732834,1.0,-0.553107


### Changing Column Data Types

In [27]:
# Check the datatype of each column 
df.dtypes

var1    float64
var2    float64
var3    float64
var4    float64
dtype: object

In [28]:
# Change datatypes of certain columns (Series) to integers:
df[['var3', 'var4']] = df[['var3', 'var4']].astype('int')
# Change datatypes of certain columns to string:
df['var2'] = df['var2'].astype('str')
df.dtypes

var1    float64
var2     object
var3      int64
var4      int64
dtype: object

In [29]:
df

Unnamed: 0,var1,var2,var3,var4
observation1,-1.033894,2.436841774030365,-1,-1
observation2,-0.166223,-0.0921095674906699,1,0
observation3,0.788574,-1.0094044003328873,0,0
observation4,-1.210072,-0.7328341829307481,1,0


### Creating new columns

In [30]:
df['var5'] = df['var1'] + df['var3']
df['var6'] = 'test'
df

Unnamed: 0,var1,var2,var3,var4,var5,var6
observation1,-1.033894,2.436841774030365,-1,-1,-2.033894,test
observation2,-0.166223,-0.0921095674906699,1,0,0.833777,test
observation3,0.788574,-1.0094044003328873,0,0,0.788574,test
observation4,-1.210072,-0.7328341829307481,1,0,-0.210072,test


### Removing columns 

In [31]:
df.drop('var2', axis=1)

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1,-1,-2.033894,test
observation2,-0.166223,1,0,0.833777,test
observation3,0.788574,0,0,0.788574,test
observation4,-1.210072,1,0,-0.210072,test


In [32]:
# Note that the dataframe itself will not change unless you use the inplace=True parameter
# The var2 column is still in the original dataframe!
df

Unnamed: 0,var1,var2,var3,var4,var5,var6
observation1,-1.033894,2.436841774030365,-1,-1,-2.033894,test
observation2,-0.166223,-0.0921095674906699,1,0,0.833777,test
observation3,0.788574,-1.0094044003328873,0,0,0.788574,test
observation4,-1.210072,-0.7328341829307481,1,0,-0.210072,test


In [33]:
# This time dropping with inplace=True
df.drop('var2', axis=1, inplace=True)
df

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1,-1,-2.033894,test
observation2,-0.166223,1,0,0.833777,test
observation3,0.788574,0,0,0.788574,test
observation4,-1.210072,1,0,-0.210072,test


In [34]:
# Drop rows in similar way
df.drop('observation4', axis=0)

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1,-1,-2.033894,test
observation2,-0.166223,1,0,0.833777,test
observation3,0.788574,0,0,0.788574,test


### Missing Data 

We will now look at basic operations in pandas to deal with missing data. 

In [35]:
# Let's add some missing values
df.loc['observation3','var3'] = np.nan
df.loc['observation2','var4'] = np.nan
df

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1.0,-1.0,-2.033894,test
observation2,-0.166223,1.0,,0.833777,test
observation3,0.788574,,0.0,0.788574,test
observation4,-1.210072,1.0,0.0,-0.210072,test


In [36]:
# Which values are missing?
df.isnull()

Unnamed: 0,var1,var3,var4,var5,var6
observation1,False,False,False,False,False
observation2,False,False,True,False,False
observation3,False,True,False,False,False
observation4,False,False,False,False,False


In [37]:
#Remove all rows in which var1 or var3 are missing values
df.dropna(subset=['var1', 'var3'])

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1.0,-1.0,-2.033894,test
observation2,-0.166223,1.0,,0.833777,test
observation4,-1.210072,1.0,0.0,-0.210072,test


In [38]:
#Remove all the columns which include np.nan values
df.dropna(axis=1)

Unnamed: 0,var1,var5,var6
observation1,-1.033894,-2.033894,test
observation2,-0.166223,0.833777,test
observation3,0.788574,0.788574,test
observation4,-1.210072,-0.210072,test


In [39]:
# Look at original dataframe (we didn't use inplace=True so the NaN values are still here)
df

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1.0,-1.0,-2.033894,test
observation2,-0.166223,1.0,,0.833777,test
observation3,0.788574,,0.0,0.788574,test
observation4,-1.210072,1.0,0.0,-0.210072,test


In [40]:
# Fill NaN with specific value
df.fillna(value=99)

Unnamed: 0,var1,var3,var4,var5,var6
observation1,-1.033894,-1.0,-1.0,-2.033894,test
observation2,-0.166223,1.0,99.0,0.833777,test
observation3,0.788574,99.0,0.0,0.788574,test
observation4,-1.210072,1.0,0.0,-0.210072,test


In [41]:
# Fill NaN with column mean
df['var3'].fillna(value=df['var3'].mean())

observation1   -1.000000
observation2    1.000000
observation3    0.333333
observation4    1.000000
Name: var3, dtype: float64

In [42]:
# Fill NaN with column mean for multiple columns
df[['var3','var4']].fillna(value=df[['var3','var4']].mean())

Unnamed: 0,var3,var4
observation1,-1.0,-1.0
observation2,1.0,-0.333333
observation3,0.333333,0.0
observation4,1.0,0.0


### Summary Part C (Modifying Data)

**We learned how to**:

- assign values
- check and modify column data types
- create new columns
- drop rows or columns
- check for missing values
- remove rows or columns with missing values
- impute missing values

**And made use of the following commands and methods**:
```
- df.loc[['observation2','observation4'],'var3'] = 1
- df.dtypes
- df['var1'].astype()
- df.drop('var2', axis=1)
- df.drop('observation4', axis=0)
- df.isnull()
- df.dropna()
- df.dropna(axis=1)
- df.fillna(value=99)
- df[['var3','var4']].fillna(value=df[['var3','var4']].mean())
```

## D: Grouped Operations


In [43]:
# Create a dataframe from a dictionary
data = {'Company': ['Google', 'Google', 'KLM', 'KLM', 'KLM', 'Ryanair', 'Ryanair'], 
        'Person': ['Saskia', 'Glenn', 'Niels', 'Martijn', 'Laura', 'Joris', 'Sara'],
       'Salary': [4000, 2400, 2800, 2800, 2600, 1600, 2100]}
df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Salary
0,Google,Saskia,4000
1,Google,Glenn,2400
2,KLM,Niels,2800
3,KLM,Martijn,2800
4,KLM,Laura,2600
5,Ryanair,Joris,1600
6,Ryanair,Sara,2100


In [44]:
# Take mean() of all numeric columns, after grouping by Company
df.groupby('Company').mean()

Unnamed: 0_level_0,Salary
Company,Unnamed: 1_level_1
Google,3200.0
KLM,2733.333333
Ryanair,1850.0


In [45]:
# Group by Company, select Salary column and calculate mean values
df.groupby('Company').Salary.mean()

Company
Google     3200.000000
KLM        2733.333333
Ryanair    1850.000000
Name: Salary, dtype: float64

In [46]:
# [Optional] More generic way, using lambda functions
# df.groupby('Company').apply(lambda x: x['Salary'].mean())

### Summary Part D (Grouped Operations)

**We learned how to**:
- create a dataframe from a dictionary
- group by a specific variable and apply a function to its groups
- apply lambda functions

**And made use of the following commands and methods**:

```
- .groupby()
- .apply()
- lambda x: ..
```

## E. Long/Wide Conversions

In [47]:
# Create a dataframe from a dictionary
data = {'Company': ['Google', 'Google', 'KLM', 'KLM', 'KLM', 'Ryanair', 'Ryanair'], 
        'Person': ['Saskia', 'Glenn', 'Niels', 'Martijn', 'Laura', 'Joris', 'Sara'],
       'Salary': [4000, 2400, 2800, 2800, 2600, 1600, 2100]}
df = pd.DataFrame(data)

# Data is in long format (usually preferred for data handling)
df

Unnamed: 0,Company,Person,Salary
0,Google,Saskia,4000
1,Google,Glenn,2400
2,KLM,Niels,2800
3,KLM,Martijn,2800
4,KLM,Laura,2600
5,Ryanair,Joris,1600
6,Ryanair,Sara,2100


In [48]:
# Pivoting to wide format (often used in reporting)
df_wide = df.pivot(index='Company', columns='Person', values='Salary')
df_wide

Person,Glenn,Joris,Laura,Martijn,Niels,Sara,Saskia
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Google,2400.0,,,,,,4000.0
KLM,,,2600.0,2800.0,2800.0,,
Ryanair,,1600.0,,,,2100.0,


In [49]:
# Melting back to long format
df_wide.reset_index() \
    .melt(id_vars='Company', var_name='Person', value_name='Salary')

Unnamed: 0,Company,Person,Salary
0,Google,Glenn,2400.0
1,KLM,Glenn,
2,Ryanair,Glenn,
3,Google,Joris,
4,KLM,Joris,
5,Ryanair,Joris,1600.0
6,Google,Laura,
7,KLM,Laura,2600.0
8,Ryanair,Laura,
9,Google,Martijn,


### Summary Part E (Long/Wide Conversions)

**We learned how to**:
- convert a dataframe in long format to wide format
- convert a dataframe in wide format to long format


**And made use of the following commands and methods**:

```
- .pivot()
- .melt()
```

## F. Combining DataFrames

There are 3 different ways of combining DataFrames in pandas:

1. merging
2. joining
3. concatenating

of which the `merge` command is the most important.

### Merge 

Merge is performed to combine two dataframes. The Merge operation has some useful parameters to set. 

> `on`: set the columns on which to join the dataframes. 

> `how`: default this parameter is set on 'inner', you have the option: 
    - left: Use keys from the left frame only
    - right: Use keys from the right frame only 
    - outer: Use union of keys from both frames 
    - inner: Use intersection of keys from both frames. 

In [50]:
# Example dataframes
df_salaries  = pd.DataFrame({'Company': ['Google', 'Google', 'KLM', 'KLM', 'KLM', 'Ryanair', 'Ryanair'], 
                             'Person': ['Saskia', 'Glenn', 'Niels', 'Martijn', 'Laura', 'Joris', 'Sara'],
                             'Salary': [4000, 2400, 2800, 2800, 2600, 1600, 2100]},
                           columns=['Person', 'Company', 'Salary'])

df_personal = pd.DataFrame({'Person': ['Sara', 'Laura', 'Glenn', 'Roy'],
                            'Age': ['25', '29', '41', '30'],
                            'Country': ['Belgium', 'Australia', 'United States', 'Trinidad and Tobago']},
                          columns=['Person', 'Age', 'Country'])

In [51]:
df_salaries

Unnamed: 0,Person,Company,Salary
0,Saskia,Google,4000
1,Glenn,Google,2400
2,Niels,KLM,2800
3,Martijn,KLM,2800
4,Laura,KLM,2600
5,Joris,Ryanair,1600
6,Sara,Ryanair,2100


In [52]:
df_personal

Unnamed: 0,Person,Age,Country
0,Sara,25,Belgium
1,Laura,29,Australia
2,Glenn,41,United States
3,Roy,30,Trinidad and Tobago


In [53]:
pd.merge(df_salaries, df_personal)

Unnamed: 0,Person,Company,Salary,Age,Country
0,Glenn,Google,2400,41,United States
1,Laura,KLM,2600,29,Australia
2,Sara,Ryanair,2100,25,Belgium


In [54]:
pd.merge(df_salaries, df_personal, how='outer', on=['Person'])

Unnamed: 0,Person,Company,Salary,Age,Country
0,Saskia,Google,4000.0,,
1,Glenn,Google,2400.0,41.0,United States
2,Niels,KLM,2800.0,,
3,Martijn,KLM,2800.0,,
4,Laura,KLM,2600.0,29.0,Australia
5,Joris,Ryanair,1600.0,,
6,Sara,Ryanair,2100.0,25.0,Belgium
7,Roy,,,30.0,Trinidad and Tobago


In [55]:
pd.merge(df_salaries, df_personal, how='right', on=['Person'])

Unnamed: 0,Person,Company,Salary,Age,Country
0,Glenn,Google,2400.0,41,United States
1,Laura,KLM,2600.0,29,Australia
2,Sara,Ryanair,2100.0,25,Belgium
3,Roy,,,30,Trinidad and Tobago


### Joining

When you want to join two dataframes based on the **index**, the columns are combined. 

In [56]:
# Example dataframes
df_salaries  = pd.DataFrame({'Company': ['Google', 'Google', 'KLM', 'KLM', 'KLM', 'Ryanair', 'Ryanair'], 
                             'Person': ['Saskia', 'Glenn', 'Niels', 'Martijn', 'Laura', 'Joris', 'Sara'],
                             'Salary': [4000, 2400, 2800, 2800, 2600, 1600, 2100]},
                           columns=['Person', 'Company', 'Salary'])

df_personal = pd.DataFrame({'Person': ['Sara', 'Laura', 'Glenn', 'Roy'],
                            'Age': ['25', '29', '41', '30'],
                            'Country': ['Belgium', 'Australia', 'United States', 'Trinidad and Tobago']},
                          columns=['Person', 'Country', 'Age'])

# Set index to 'Person'
df_salaries.set_index('Person', inplace=True)
df_personal.set_index('Person', inplace=True)

In [57]:
df_salaries

Unnamed: 0_level_0,Company,Salary
Person,Unnamed: 1_level_1,Unnamed: 2_level_1
Saskia,Google,4000
Glenn,Google,2400
Niels,KLM,2800
Martijn,KLM,2800
Laura,KLM,2600
Joris,Ryanair,1600
Sara,Ryanair,2100


In [58]:
df_personal

Unnamed: 0_level_0,Country,Age
Person,Unnamed: 1_level_1,Unnamed: 2_level_1
Sara,Belgium,25
Laura,Australia,29
Glenn,United States,41
Roy,Trinidad and Tobago,30


In [59]:
df_salaries.join(df_personal, how='inner')

Unnamed: 0_level_0,Company,Salary,Country,Age
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Glenn,Google,2400,United States,41
Laura,KLM,2600,Australia,29
Sara,Ryanair,2100,Belgium,25


### Concatenating

When you Concatenate DataFrames together the dataframes are glued together, note that the dimensions of the dataframes should match along the axis. 

In [60]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'C': ['C0', 'C1', 'C2'],}, 
                   index= [0,1,2])

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5'],
                    'C': ['C3', 'C4', 'C5'],}, 
                   index= [4,5,6])

df3 = pd.DataFrame({'A': ['A6', 'A7', 'A8'],
                    'B': ['B6', 'B7', 'B8'],
                    'C': ['C6', 'C7', 'C8'],}, 
                   index= [7,8,9])

In [61]:
pd.concat([df1,df2,df3])

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
4,A3,B3,C3
5,A4,B4,C4
6,A5,B5,C5
7,A6,B6,C6
8,A7,B7,C7
9,A8,B8,C8


### Summary Part F (Combining DataFrames)

**We learned how to**:
- merge, join and concatenate dataframes together
- how inner, outer, left and right joins work

**And made use of the following commands and methods**:

```
- pd.merge()
- df1.join(df2)
- pd.concat()
```

# Conclusion

**This concludes our discussion of:**
- A. Creating DataFrames
- B. Selecting and Sorting
- C. Modifying Data
- D. Grouped Operations
- E. Long/Wide Conversions
- F. Combining DataFrames

**Next**: Exercises can be found in the same directory as this notebook.


# References

- Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/
- Python for Data Analysis (Wes McKinney): https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1449319793