# Extracting and trasforming Data

In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data.

- Indexing Dataframes
- index ordering
- positional and labeled indexing
- slicing dataframes
- filtering dataframes
- filtering using NaNs
- Transforming Dataframes
- Using apply() to transform a column
- using map() with a dictionary
- Using Vectorized Function - built in UFuncs

## Indexing Dataframes



In [64]:
import numpy as np

In [49]:
import pandas as pd

df = pd.read_csv('./data/sales.csv', index_col='month')

In [50]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


In [51]:
# Using Square Brackets

df['salt']['Jan']

12.0

In [52]:
# Using Column attribute and row label

df.eggs['Mar']

221

In [53]:
# using .loc accessor

df.loc['May', 'spam']

52

In [54]:
# Using .iloc

df.iloc[4, 2]

52

#### Indexing and column rearrangement

There are circumstances in which it's useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

Your job is to read the CSV file and set the index to 'county'. You'll then assign a new DataFrame by selecting the list of columns ['winner', 'total', 'voters']. The CSV file is provided to you in the variable filename

In [13]:
# Selecting only some columns

df_new = df[['salt', 'eggs']]

df_new

Unnamed: 0_level_0,salt,eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,12.0,47
Feb,50.0,110
Mar,89.0,221
Apr,87.0,77
May,,132
Jun,60.0,205


# Slicing DataFrames



In [14]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


### Using Slicers

In [15]:
# returns Series

df['eggs']

month
Jan     47
Feb    110
Mar    221
Apr     77
May    132
Jun    205
Name: eggs, dtype: int64

In [16]:
# positional slicing

df['eggs'][1:4]

month
Feb    110
Mar    221
Apr     77
Name: eggs, dtype: int64

In [17]:
df['eggs'][4]

132

In [18]:
# slicing using iloc

df.loc[:, 'eggs':'salt']

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
Feb,110,50.0
Mar,221,89.0
Apr,77,87.0
May,132,
Jun,205,60.0


In [19]:
df.loc[:, ['eggs', 'salt']]

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
Feb,110,50.0
Mar,221,89.0
Apr,77,87.0
May,132,
Jun,205,60.0


In [20]:
df.loc['Jan':'Apr', :]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20


In [21]:
df.loc['Mar':'May', 'salt':'spam']

Unnamed: 0_level_0,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Mar,89.0,72
Apr,87.0,20
May,,52


In [22]:
# using .iloc

df.iloc[2:5, 1:]

Unnamed: 0_level_0,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Mar,89.0,72
Apr,87.0,20
May,,52


### Using Lists

In [23]:
df.loc['Jan':'May', ['eggs', 'spam']]

Unnamed: 0_level_0,eggs,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,17
Feb,110,31
Mar,221,72
Apr,77,20
May,132,52


In [24]:
# usign iloc for list of 3 rows
# remeber, 0:2 only selects 2 columns

df.iloc[[0,4,5], 0:2]

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
May,132,
Jun,205,60.0


### Series vs. 1 Column Dataframe

In [25]:
# yields a Series

df['eggs']

month
Jan     47
Feb    110
Mar    221
Apr     77
May    132
Jun    205
Name: eggs, dtype: int64

In [26]:
# returns dataframe of a single columns

df[['eggs']]

Unnamed: 0_level_0,eggs
month,Unnamed: 1_level_1
Jan,47
Feb,110
Mar,221
Apr,77
May,132
Jun,205


# Filtering

- Creating Boolean Series
- Combining Filters using &, |, <>

In [27]:
# creating Boolean Series

df.salt>60

month
Jan    False
Feb    False
Mar     True
Apr     True
May    False
Jun    False
Name: salt, dtype: bool

In [28]:
# Using Boolean Series as a filter is very powerful

df[df.salt>60]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Apr,77,87.0,20


In [29]:
enough_salt_sold = df.salt>60

df[enough_salt_sold]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Apr,77,87.0,20


#### Combining Filters

In [30]:
df[(df.salt>60) & (df.eggs<200)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,77,87.0,20


In [31]:
df[(df.salt>=50) | (df.eggs<200)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


#### Dataframes with zeros and NaNs

In [32]:
df2 = df.copy()

df2['bacon'] = [0,0, 50, 60, 70, 80]

df2

Unnamed: 0_level_0,eggs,salt,spam,bacon
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,0
Feb,110,50.0,31,0
Mar,221,89.0,72,50
Apr,77,87.0,20,60
May,132,,52,70
Jun,205,60.0,55,80


In [33]:
# Excludes Bacon Column because it has 0 entries
# Select which columns have non-zero values
df2.loc[:, df2.all()]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


In [34]:
# find which columns have any nonzero columns
# dataframe is returned because no zero columns

df2.loc[:, df2.any()]

Unnamed: 0_level_0,eggs,salt,spam,bacon
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,0
Feb,110,50.0,31,0
Mar,221,89.0,72,50
Apr,77,87.0,20,60
May,132,,52,70
Jun,205,60.0,55,80


In [35]:
# Which columns contain NaN

df.loc[:, df.isnull().any()]

Unnamed: 0_level_0,salt
month,Unnamed: 1_level_1
Jan,12.0
Feb,50.0
Mar,89.0
Apr,87.0
May,
Jun,60.0


In [36]:
# which columns with no NaN values

df.loc[:, df.notnull().all()]

Unnamed: 0_level_0,eggs,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,17
Feb,110,31
Mar,221,72
Apr,77,20
May,132,52
Jun,205,55


#### Drop rows with any NaNs

In [37]:
# notice that May is dropped because it did have NaN

df.dropna(how='any')

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
Jun,205,60.0,55


In [38]:
# notice that May kept with all argument

df.dropna(how='all')

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


## Modifying column based on another

In [39]:
# this adds a 5 to the Salt column for specific filter

df.eggs[df.salt>55] += 5

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [40]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,226,89.0,72
Apr,82,87.0,20
May,132,,52
Jun,210,60.0,55


In [45]:
titanic = pd.read_csv('./data/titanic.csv')

In [48]:
# Select the 'age' and 'cabin' columns: df
df_titanic = titanic.loc[:, ['age', 'cabin']]

# Print the shape of df
print(df_titanic.shape)

# Drop rows in df with how='any' and print the shape
print(df_titanic.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print(df_titanic.dropna(how='all').shape)

# Drop columns in titanic with less than 1000 non-missing values
print(df_titanic.dropna(thresh=1000, axis='columns').info())

(1309, 2)
(272, 2)
(1069, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 1 columns):
age    1046 non-null float64
dtypes: float64(1)
memory usage: 10.3 KB
None


# Transforming Dataframes

Once you filter your data, you'll want to transform it.

Use NumPy ufuncs or Universal Functions to transform entire columns of data 'elementwise'

Allows you to transform entire dataframe vectorized, or elementwise, without having to write any complex loops

In [59]:
#### DataFrame vectorized methods

In [60]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


Suppose we want to convert sales numbers into units of whole dozens (rounded down).  

- Use floordiv from pandas
- usd numpy's floor_divide function

In [62]:
# Convert to dozens unit using pandas

df.floordiv(12)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3,1.0,1
Feb,9,4.0,2
Mar,18,7.0,6
Apr,6,7.0,1
May,11,,4
Jun,17,5.0,4


In [65]:
# Convert to dozens unit using Numpy

np.floor_divide(df, 12)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3.0,1.0,1.0
Feb,9.0,4.0,2.0
Mar,18.0,7.0,6.0
Apr,6.0,7.0,1.0
May,11.0,,4.0
Jun,17.0,5.0,4.0


In [67]:
# Create custom function using apply() method

def dozens(n):
    return n//12

df.apply(dozens)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3,1.0,1
Feb,9,4.0,2
Mar,18,7.0,6
Apr,6,7.0,1
May,11,,4
Jun,17,5.0,4


In [69]:
# use lambda function within an apply method
# lambda creates a one line convenient definition
# this definition is a throw away function after a single use

df.apply(lambda n: n//12)

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,3,1.0,1
Feb,9,4.0,2
Mar,18,7.0,6
Apr,6,7.0,1
May,11,,4
Jun,17,5.0,4


#### Storing a transformation 

assign transformed data to a new column

In [71]:
df['dozens_of_eggs'] = df.eggs.floordiv(12)

df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,3
Feb,110,50.0,31,9
Mar,221,89.0,72,18
Apr,77,87.0,20,6
May,132,,52,11
Jun,205,60.0,55,17


## String operations

Use handy 'str' attribute

- str.upper()
- map(str.lower) for an index
- map(custom function) for an index

In [72]:
df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,3
Feb,110,50.0,31,9
Mar,221,89.0,72,18
Apr,77,87.0,20,6
May,132,,52,11
Jun,205,60.0,55,17


In [73]:
df.index

Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'], dtype='object', name='month')

In [76]:
df.index = df.index.str.upper()

df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
JAN,47,12.0,17,3
FEB,110,50.0,31,9
MAR,221,89.0,72,18
APR,77,87.0,20,6
MAY,132,,52,11
JUN,205,60.0,55,17


In [79]:
# For index, there is no apply method.  
# Use map for custom transformations

df.index = df.index.map(str.lower)

df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
jan,47,12.0,17,3
feb,110,50.0,31,9
mar,221,89.0,72,18
apr,77,87.0,20,6
may,132,,52,11
jun,205,60.0,55,17


In [81]:
df['salty_eggs'] = df.salt + df.dozens_of_eggs

df

Unnamed: 0_level_0,eggs,salt,spam,dozens_of_eggs,salty_eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
jan,47,12.0,17,3,15.0
feb,110,50.0,31,9,59.0
mar,221,89.0,72,18,107.0
apr,77,87.0,20,6,93.0
may,132,,52,11,
jun,205,60.0,55,17,77.0


#### Using apply() to transform a column

The .apply() method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this exercise you'll take daily weather data in Pittsburgh in 2013 obtained from Weather Underground.

A function to convert degrees Fahrenheit to degrees Celsius has been written for you. Your job is to use the .apply() method to perform this conversion on the 'Mean TemperatureF' and 'Mean Dew PointF' columns of the weather DataFrame.



In [82]:
weather = pd.read_csv('./data/pittsburgh2013.csv')

In [84]:
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)

# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather.loc[:, ['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)

# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']

# Print the output of df_celsius.head()
print(df_celsius.head())

   Mean TemperatureC  Mean Dew PointC
0          -2.222222        -2.777778
1          -6.111111       -11.111111
2          -4.444444        -9.444444
3          -2.222222        -7.222222
4          -1.111111        -6.666667


#### Using .map() with a dictionary

The .map() method is used to transform values according to a Python dictionary look-up. In this exercise you'll practice this method while returning to working with the election DataFrame, which has been pre-loaded for you.

Your job is to use a dictionary to map the values 'Obama' and 'Romney' in the 'winner' column to the values 'blue' and 'red', and assign the output to the new column 'color'.



In [86]:
election = pd.read_csv('./data/pennsylvania2012_turnout.csv')

In [88]:
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama': 'blue', 'Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.loc[:, 'winner'].map(red_vs_blue)

# Print the output of election.head()
print(election.head())

      county state   total      Obama     Romney  winner  voters    turnout  \
0      Adams    PA   41973  35.482334  63.112001  Romney   61156  68.632677   
1  Allegheny    PA  614671  56.640219  42.185820   Obama  924351  66.497575   
2  Armstrong    PA   28322  30.696985  67.901278  Romney   42147  67.198140   
3     Beaver    PA   80015  46.032619  52.637630  Romney  115157  69.483401   
4    Bedford    PA   21444  22.057452  76.986570  Romney   32189  66.619031   

      margin color  
0  27.629667   red  
1  14.454399  blue  
2  37.204293   red  
3   6.605012   red  
4  54.929118   red  


#### Using vectorized functions

When performance is paramount, you should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, you can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

You can even write your own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas.

In this exercise you're going to import the zscore function from scipy.stats and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.

Instead of using .apply() as you did in the earlier exercises, the zscore UFunc will take a pandas Series as input and return a NumPy array. You will then assign the values of the NumPy array to a new column in the DataFrame. You will be working with the election DataFrame - it has been pre-loaded for you.

In [93]:
from scipy.stats import zscore

In [94]:
# Import zscore from scipy.stats
from scipy.stats import zscore

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])

# Print the type of turnout_zscore
print(type(turnout_zscore))

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
print(election.head())

<class 'numpy.ndarray'>
      county state   total      Obama     Romney  winner  voters    turnout  \
0      Adams    PA   41973  35.482334  63.112001  Romney   61156  68.632677   
1  Allegheny    PA  614671  56.640219  42.185820   Obama  924351  66.497575   
2  Armstrong    PA   28322  30.696985  67.901278  Romney   42147  67.198140   
3     Beaver    PA   80015  46.032619  52.637630  Romney  115157  69.483401   
4    Bedford    PA   21444  22.057452  76.986570  Romney   32189  66.619031   

      margin color  turnout_zscore  
0  27.629667   red        0.853734  
1  14.454399  blue        0.439846  
2  37.204293   red        0.575650  
3   6.605012   red        1.018647  
4  54.929118   red        0.463391  


Superb! Using vectorized functions like this fully leverages the power of pandas.

