# Extracting and trasforming Data

In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data.

## Indexing Dataframes



In [2]:
import pandas as pd

df = pd.read_csv('./data/sales.csv', index_col='month')

In [6]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


In [5]:
# Using Square Brackets

df['salt']['Jan']

12.0

In [9]:
# Using Column attribute and row label

df.eggs['Mar']

221

In [11]:
# using .loc accessor

df.loc['May', 'spam']

52

In [13]:
# Using .iloc

df.iloc[4, 2]

52

#### Indexing and column rearrangement

There are circumstances in which it's useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

Your job is to read the CSV file and set the index to 'county'. You'll then assign a new DataFrame by selecting the list of columns ['winner', 'total', 'voters']. The CSV file is provided to you in the variable filename

In [15]:
# Selecting only some columns

df_new = df[['salt', 'eggs']]

df_new

Unnamed: 0_level_0,salt,eggs
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,12.0,47
Feb,50.0,110
Mar,89.0,221
Apr,87.0,77
May,,132
Jun,60.0,205


# Slicing DataFrames



In [16]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


### Using Slicers

In [18]:
# returns Series

df['eggs']

month
Jan     47
Feb    110
Mar    221
Apr     77
May    132
Jun    205
Name: eggs, dtype: int64

In [20]:
# positional slicing

df['eggs'][1:4]

month
Feb    110
Mar    221
Apr     77
Name: eggs, dtype: int64

In [21]:
df['eggs'][4]

132

In [22]:
# slicing using iloc

df.loc[:, 'eggs':'salt']

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
Feb,110,50.0
Mar,221,89.0
Apr,77,87.0
May,132,
Jun,205,60.0


In [23]:
df.loc[:, ['eggs', 'salt']]

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
Feb,110,50.0
Mar,221,89.0
Apr,77,87.0
May,132,
Jun,205,60.0


In [24]:
df.loc['Jan':'Apr', :]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20


In [25]:
df.loc['Mar':'May', 'salt':'spam']

Unnamed: 0_level_0,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Mar,89.0,72
Apr,87.0,20
May,,52


In [26]:
# using .iloc

df.iloc[2:5, 1:]

Unnamed: 0_level_0,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Mar,89.0,72
Apr,87.0,20
May,,52


### Using Lists

In [27]:
df.loc['Jan':'May', ['eggs', 'spam']]

Unnamed: 0_level_0,eggs,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,17
Feb,110,31
Mar,221,72
Apr,77,20
May,132,52


In [29]:
# usign iloc for list of 3 rows
# remeber, 0:2 only selects 2 columns

df.iloc[[0,4,5], 0:2]

Unnamed: 0_level_0,eggs,salt
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,12.0
May,132,
Jun,205,60.0


### Series vs. 1 Column Dataframe

In [30]:
# yields a Series

df['eggs']

month
Jan     47
Feb    110
Mar    221
Apr     77
May    132
Jun    205
Name: eggs, dtype: int64

In [32]:
# returns dataframe of a single columns

df[['eggs']]

Unnamed: 0_level_0,eggs
month,Unnamed: 1_level_1
Jan,47
Feb,110
Mar,221
Apr,77
May,132
Jun,205


# Filtering

- Creating Boolean Series
- Combining Filters using &, |, <>

In [33]:
# creating Boolean Series

df.salt>60

month
Jan    False
Feb    False
Mar     True
Apr     True
May    False
Jun    False
Name: salt, dtype: bool

In [36]:
# Using Boolean Series as a filter is very powerful

df[df.salt>60]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Apr,77,87.0,20


In [37]:
enough_salt_sold = df.salt>60

df[enough_salt_sold]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mar,221,89.0,72
Apr,77,87.0,20


#### Combining Filters

In [38]:
df[(df.salt>60) & (df.eggs<200)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,77,87.0,20


In [39]:
df[(df.salt>=50) | (df.eggs<200)]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


#### Dataframes with zeros and NaNs

In [41]:
df2 = df.copy()

df2['bacon'] = [0,0, 50, 60, 70, 80]

df2

Unnamed: 0_level_0,eggs,salt,spam,bacon
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,0
Feb,110,50.0,31,0
Mar,221,89.0,72,50
Apr,77,87.0,20,60
May,132,,52,70
Jun,205,60.0,55,80


In [46]:
# Excludes Bacon Column because it has 0 entries
# Select which columns have non-zero values
df2.loc[:, df2.all()]

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


In [47]:
# find which columns have any nonzero columns
# dataframe is returned because no zero columns

df2.loc[:, df2.any()]

Unnamed: 0_level_0,eggs,salt,spam,bacon
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jan,47,12.0,17,0
Feb,110,50.0,31,0
Mar,221,89.0,72,50
Apr,77,87.0,20,60
May,132,,52,70
Jun,205,60.0,55,80


In [48]:
# Which columns contain NaN

df.loc[:, df.isnull().any()]

Unnamed: 0_level_0,salt
month,Unnamed: 1_level_1
Jan,12.0
Feb,50.0
Mar,89.0
Apr,87.0
May,
Jun,60.0


In [49]:
# which columns with no NaN values

df.loc[:, df.notnull().all()]

Unnamed: 0_level_0,eggs,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,47,17
Feb,110,31
Mar,221,72
Apr,77,20
May,132,52
Jun,205,55


#### Drop rows with any NaNs

In [51]:
# notice that May is dropped because it did have NaN

df.dropna(how='any')

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
Jun,205,60.0,55


In [53]:
# notice that May kept with all argument

df.dropna(how='all')

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


## Modifying column based on another

In [59]:
# this adds a 5 to the Salt column for specific filter

df.eggs[df.salt>55] += 5

In [58]:
df

Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,236,89.0,72
Apr,92,87.0,20
May,132,,52
Jun,220,60.0,55
