# Manipulating DataFrames with Python | Part 1 (Slicing, Filtering and Indexing)

Source code from Medium's article [Manipulating DataFrames with Python | Part 1 (Slicing, Filtering and Indexing)](https://bit.ly/2PztdVY), written by [Ankita Prasad](https://ankita2108prasad.medium.com/).

## Importing libraries

In [1]:
import pandas as pd

## Read dataset

In [2]:
df = pd.read_pickle('./hs_aws_costos')

In [3]:
df

Unnamed: 0,UsageType,Service,Cost,CostUnit,PeriodStart,PeriodEnd,Total,Estimated,CreatedAt
0,USE1-CostDataStorage,AWS Cost Explorer,0.015232,USD,2020-12-01,2020-12-02,{},False,2021-03-24 05:11:08.261684
1,USE1-CostDataStorage,AWS Cost Explorer,0.015740,USD,2020-12-02,2020-12-03,{},False,2021-03-24 05:11:08.261684
2,USE1-CostDataStorage,AWS Cost Explorer,0.015848,USD,2020-12-03,2020-12-04,{},False,2021-03-24 05:11:08.261684
3,USE1-CostDataStorage,AWS Cost Explorer,0.016307,USD,2020-12-04,2020-12-05,{},False,2021-03-24 05:11:08.261684
4,USE1-CostDataStorage,AWS Cost Explorer,0.016076,USD,2020-12-05,2020-12-06,{},False,2021-03-24 05:11:08.261684
...,...,...,...,...,...,...,...,...,...
19580,USW2-TimedStorage-ByteHrs,Tax,0.000000,USD,2021-03-01,2021-03-02,{},True,2021-03-24 05:11:13.860169
19581,USW2-USE1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-01,2021-03-02,{},True,2021-03-24 05:11:13.860169
19582,USW2-USW1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-01,2021-03-02,{},True,2021-03-24 05:11:13.860169
19583,USW2-UnusedStaticIP,Tax,0.000000,USD,2021-03-01,2021-03-02,{},True,2021-03-24 05:11:13.860169


## Slicing DataFrames

In [4]:
# Returns series of "col_name" of x to y row index -> part of #col_name from x to y-1 index
#
df['Service'][0:4]

0    AWS Cost Explorer
1    AWS Cost Explorer
2    AWS Cost Explorer
3    AWS Cost Explorer
Name: Service, dtype: object

In [5]:
df1 = df.set_index('PeriodStart')
df1

Unnamed: 0_level_0,UsageType,Service,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-01,USE1-CostDataStorage,AWS Cost Explorer,0.015232,USD,2020-12-02,{},False,2021-03-24 05:11:08.261684
2020-12-02,USE1-CostDataStorage,AWS Cost Explorer,0.015740,USD,2020-12-03,{},False,2021-03-24 05:11:08.261684
2020-12-03,USE1-CostDataStorage,AWS Cost Explorer,0.015848,USD,2020-12-04,{},False,2021-03-24 05:11:08.261684
2020-12-04,USE1-CostDataStorage,AWS Cost Explorer,0.016307,USD,2020-12-05,{},False,2021-03-24 05:11:08.261684
2020-12-05,USE1-CostDataStorage,AWS Cost Explorer,0.016076,USD,2020-12-06,{},False,2021-03-24 05:11:08.261684
...,...,...,...,...,...,...,...,...
2021-03-01,USW2-TimedStorage-ByteHrs,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,USW2-USE1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,USW2-USW1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,USW2-UnusedStaticIP,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169


In [6]:
# Slicing using loc -> here row_indexes are labels.
# Returns series of data containing rows and columns sequentially
# from col1 to coln and row1 to rown.
#
df1.loc['2020-12-01':'2020-12-02', 'Service': 'Cost']

Unnamed: 0_level_0,Service,Cost
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-12-01,AWS Cost Explorer,0.015232
2020-12-02,AWS Cost Explorer,0.015740
2020-12-01,AWS Global Accelerator,0.600000
2020-12-02,AWS Global Accelerator,0.600000
2020-12-01,AWS Key Management Service,0.000000
...,...,...
2020-12-01,Tax,0.000000
2020-12-01,Tax,0.000000
2020-12-01,Tax,0.000000
2020-12-01,Tax,0.000000


In [7]:
# Returns the value from row col pair in dataframe.
#
df1.loc['2020-12-24', 'Cost']

PeriodStart
2020-12-24    0.015915
2020-12-24    0.600000
2020-12-24    0.006453
2020-12-24    0.000000
2020-12-24    0.000980
                ...   
2020-12-24    0.000000
2020-12-24    0.000000
2020-12-24    0.532258
2020-12-24    0.000000
2020-12-24    0.000000
Name: Cost, Length: 178, dtype: float32

In [8]:
# Slicing using iloc -> here row indexes are numbers.
# Returns the series of data from row_num_start to row_num_end-1
#
df1.iloc[[0, 9]]

Unnamed: 0_level_0,UsageType,Service,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-01,USE1-CostDataStorage,AWS Cost Explorer,0.015232,USD,2020-12-02,{},False,2021-03-24 05:11:08.261684
2020-12-10,USE1-CostDataStorage,AWS Cost Explorer,0.016213,USD,2020-12-11,{},False,2021-03-24 05:11:08.261684


In [9]:
# Returns the dataframe from row_num_start to row_num_end-1 and 
# from col_num_start to col_num_end-1.
#
df1.iloc[[0, 9], [1, 2]]

Unnamed: 0_level_0,Service,Cost
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-12-01,AWS Cost Explorer,0.015232
2020-12-10,AWS Cost Explorer,0.016213


In [10]:
# Returns the value from the row col pair in dataframe
#
df1.iloc[0, 0]

'USE1-CostDataStorage'

## Filtering DataFrames

In [11]:
# Creating boolean series.
# df.Cost > 10 is the filter for selecting only the rows which have value
# greater than 10, filters can also be combined using |, & and ! operators.
#
df1.loc[df1.Cost > 60]

Unnamed: 0_level_0,UsageType,Service,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-01,USW2-HeavyUsage:r5a.xlarge,Amazon Elastic Compute Cloud - Compute,727.632019,USD,2020-12-02,{},False,2021-03-24 05:11:11.372433
2021-01-01,USW2-HeavyUsage:r5a.xlarge,Amazon Elastic Compute Cloud - Compute,727.632019,USD,2021-01-02,{},False,2021-03-24 05:11:11.372433
2021-02-01,USW2-HeavyUsage:r5a.xlarge,Amazon Elastic Compute Cloud - Compute,657.216003,USD,2021-02-02,{},False,2021-03-24 05:11:11.372433
2021-03-01,USW2-HeavyUsage:r5a.xlarge,Amazon Elastic Compute Cloud - Compute,727.632019,USD,2021-03-02,{},True,2021-03-24 05:11:11.372433


In [12]:
# Selecting columns with all non-zeros
#
df1.loc[:, df1.all()]

Unnamed: 0_level_0,UsageType,Service,CostUnit,PeriodEnd,CreatedAt
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-12-01,USE1-CostDataStorage,AWS Cost Explorer,USD,2020-12-02,2021-03-24 05:11:08.261684
2020-12-02,USE1-CostDataStorage,AWS Cost Explorer,USD,2020-12-03,2021-03-24 05:11:08.261684
2020-12-03,USE1-CostDataStorage,AWS Cost Explorer,USD,2020-12-04,2021-03-24 05:11:08.261684
2020-12-04,USE1-CostDataStorage,AWS Cost Explorer,USD,2020-12-05,2021-03-24 05:11:08.261684
2020-12-05,USE1-CostDataStorage,AWS Cost Explorer,USD,2020-12-06,2021-03-24 05:11:08.261684
...,...,...,...,...,...
2021-03-01,USW2-TimedStorage-ByteHrs,Tax,USD,2021-03-02,2021-03-24 05:11:13.860169
2021-03-01,USW2-USE1-AWS-Out-Bytes,Tax,USD,2021-03-02,2021-03-24 05:11:13.860169
2021-03-01,USW2-USW1-AWS-Out-Bytes,Tax,USD,2021-03-02,2021-03-24 05:11:13.860169
2021-03-01,USW2-UnusedStaticIP,Tax,USD,2021-03-02,2021-03-24 05:11:13.860169


In [13]:
# Selecting columns with any non-zeros
#
df1.loc[:, df1.any()]

Unnamed: 0_level_0,UsageType,Service,Cost,CostUnit,PeriodEnd,Estimated,CreatedAt
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-12-01,USE1-CostDataStorage,AWS Cost Explorer,0.015232,USD,2020-12-02,False,2021-03-24 05:11:08.261684
2020-12-02,USE1-CostDataStorage,AWS Cost Explorer,0.015740,USD,2020-12-03,False,2021-03-24 05:11:08.261684
2020-12-03,USE1-CostDataStorage,AWS Cost Explorer,0.015848,USD,2020-12-04,False,2021-03-24 05:11:08.261684
2020-12-04,USE1-CostDataStorage,AWS Cost Explorer,0.016307,USD,2020-12-05,False,2021-03-24 05:11:08.261684
2020-12-05,USE1-CostDataStorage,AWS Cost Explorer,0.016076,USD,2020-12-06,False,2021-03-24 05:11:08.261684
...,...,...,...,...,...,...,...
2021-03-01,USW2-TimedStorage-ByteHrs,Tax,0.000000,USD,2021-03-02,True,2021-03-24 05:11:13.860169
2021-03-01,USW2-USE1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-02,True,2021-03-24 05:11:13.860169
2021-03-01,USW2-USW1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-02,True,2021-03-24 05:11:13.860169
2021-03-01,USW2-UnusedStaticIP,Tax,0.000000,USD,2021-03-02,True,2021-03-24 05:11:13.860169


In [14]:
# Selecting columns with any NaNs.
#
df1.loc[:, df1.isnull().any()]

2020-12-01
2020-12-02
2020-12-03
2020-12-04
2020-12-05
...
2021-03-01
2021-03-01
2021-03-01
2021-03-01
2021-03-01


In [15]:
# Selecting columns with no NaN values
#
df1.loc[:, df1.isnull().all()]

2020-12-01
2020-12-02
2020-12-03
2020-12-04
2020-12-05
...
2021-03-01
2021-03-01
2021-03-01
2021-03-01
2021-03-01


In [16]:
# Dropping rows with any NaNs.
#
df2 = df1.dropna(how = 'any')
df2

Unnamed: 0_level_0,UsageType,Service,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-01,USE1-CostDataStorage,AWS Cost Explorer,0.015232,USD,2020-12-02,{},False,2021-03-24 05:11:08.261684
2020-12-02,USE1-CostDataStorage,AWS Cost Explorer,0.015740,USD,2020-12-03,{},False,2021-03-24 05:11:08.261684
2020-12-03,USE1-CostDataStorage,AWS Cost Explorer,0.015848,USD,2020-12-04,{},False,2021-03-24 05:11:08.261684
2020-12-04,USE1-CostDataStorage,AWS Cost Explorer,0.016307,USD,2020-12-05,{},False,2021-03-24 05:11:08.261684
2020-12-05,USE1-CostDataStorage,AWS Cost Explorer,0.016076,USD,2020-12-06,{},False,2021-03-24 05:11:08.261684
...,...,...,...,...,...,...,...,...
2021-03-01,USW2-TimedStorage-ByteHrs,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,USW2-USE1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,USW2-USW1-AWS-Out-Bytes,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,USW2-UnusedStaticIP,Tax,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169


In [17]:
# Filtering a column based on another .
#
# df.eggs[df.salt > 55]

In [18]:
# Modyfying a column based on another.
#
# df.eggs[df.salt > 55] += 5

In [19]:
# Convert to dozens(12) unit
#
# df.floordiv(12)
# df.apply(lambda x: x//12)

## Index Objects and Labeled Data

### 1. Creating a Series v/s Creating an Index

In [20]:
prices = [10.70, 10.86, 10.74, 8.48]
shares = pd.Series(prices)
print(shares)

0    10.70
1    10.86
2    10.74
3     8.48
dtype: float64


In [21]:
days = ['Mon', 'Tue', 'Wed', 'Thur']
prices = [10.70, 10.86, 10.74, 8.48]

shares = pd.Series(prices, index=days)

print(shares)

Mon     10.70
Tue     10.86
Wed     10.74
Thur     8.48
dtype: float64


### 2. Modifying Index Values

In [23]:
# Deliberately introduced error.
#
shares.index[2] = 'Wednesday'

TypeError: Index does not support mutable operations

In [25]:
# Right way -> overriding all at once.

shares.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
print(shares)                

Monday       10.70
Tuesday      10.86
Wednesday    10.74
Thursday      8.48
dtype: float64


In [None]:
#Assigning Index Values
#
# unemployment.index = unemployment['zip'] #where zip is a column in unemployment

### 3. Hierarchical Indexing (Multi-indexing)

In [43]:
df2 = df.set_index(['PeriodStart', 'Service'])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,UsageType,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Service,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-01,AWS Cost Explorer,USE1-CostDataStorage,0.015232,USD,2020-12-02,{},False,2021-03-24 05:11:08.261684
2020-12-02,AWS Cost Explorer,USE1-CostDataStorage,0.015740,USD,2020-12-03,{},False,2021-03-24 05:11:08.261684
2020-12-03,AWS Cost Explorer,USE1-CostDataStorage,0.015848,USD,2020-12-04,{},False,2021-03-24 05:11:08.261684
2020-12-04,AWS Cost Explorer,USE1-CostDataStorage,0.016307,USD,2020-12-05,{},False,2021-03-24 05:11:08.261684
2020-12-05,AWS Cost Explorer,USE1-CostDataStorage,0.016076,USD,2020-12-06,{},False,2021-03-24 05:11:08.261684
...,...,...,...,...,...,...,...,...
2021-03-01,Tax,USW2-TimedStorage-ByteHrs,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,Tax,USW2-USE1-AWS-Out-Bytes,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,Tax,USW2-USW1-AWS-Out-Bytes,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169
2021-03-01,Tax,USW2-UnusedStaticIP,0.000000,USD,2021-03-02,{},True,2021-03-24 05:11:13.860169


### Fancy Indexing

#### Outermost Indexing

In [45]:
df2.loc[(['2020-12-24', '2020-12-31'], 'AWS Global Accelerator'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,UsageType,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Service,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-24,AWS Global Accelerator,Global-Accelerator-fixed-fee,0.6,USD,2020-12-25,{},False,2021-03-24 05:11:08.477572
2020-12-31,AWS Global Accelerator,Global-Accelerator-fixed-fee,0.5,USD,2021-01-01,{},False,2021-03-24 05:11:08.477572


#### Innermost Indexing

In [50]:
df2.sort_index(inplace=True)
df2.loc[('2020-12-31', ['AWS Global Accelerator', 'AWS Cost Explorer']), 'Cost']

PeriodStart  Service               
2020-12-31   AWS Cost Explorer         0.016524
             AWS Global Accelerator    0.500000
Name: Cost, dtype: float32

### 5. Sorting Indexes

In [52]:
df2 = df2.sort_index()
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,UsageType,Cost,CostUnit,PeriodEnd,Total,Estimated,CreatedAt
PeriodStart,Service,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-12-01,AWS Cost Explorer,USE1-CostDataStorage,1.523226e-02,USD,2020-12-02,{},False,2021-03-24 05:11:08.261684
2020-12-01,AWS Global Accelerator,Global-Accelerator-fixed-fee,6.000000e-01,USD,2020-12-02,{},False,2021-03-24 05:11:08.477572
2020-12-01,AWS Key Management Service,us-west-1-KMS-Requests,0.000000e+00,USD,2020-12-02,{},False,2021-03-24 05:11:08.683079
2020-12-01,AWS Lambda,USW1-DataTransfer-In-Bytes,0.000000e+00,USD,2020-12-02,{},False,2021-03-24 05:11:08.959199
2020-12-01,AWS Lambda,USW1-DataTransfer-Out-Bytes,0.000000e+00,USD,2020-12-02,{},False,2021-03-24 05:11:08.959199
...,...,...,...,...,...,...,...,...
2021-03-21,EC2 - Other,USW2-USE1-AWS-Out-Bytes,1.258390e-05,USD,2021-03-22,{},True,2021-03-24 05:11:10.048152
2021-03-21,EC2 - Other,USW2-USE2-AWS-In-Bytes,0.000000e+00,USD,2021-03-22,{},True,2021-03-24 05:11:10.048152
2021-03-21,EC2 - Other,USW2-USE2-AWS-Out-Bytes,1.738000e-07,USD,2021-03-22,{},True,2021-03-24 05:11:10.048152
2021-03-21,EC2 - Other,USW2-USW1-AWS-In-Bytes,0.000000e+00,USD,2021-03-22,{},True,2021-03-24 05:11:10.048152
