### Slicing and Indexing DataFrames
Slicing is a technique for selecting consecutive elements from objects.

In [50]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
   'col1': ['Item0', 'Item0', 'Item0', 'Item1', 'ItemA', 'ItemA', 'ItemC'],
   'col2': ['Gold', 'Bronze', 'Gold', 'Silver', 'Silver', 'Bronze', 'Gold',],
   'col3': [1, 2, 5, 4, 2, 6, 10],
   'col4': [0.1, 8, 35, 14, 8, 9, 0.321],
   'col5': ['moderate', 'easy', 'easy', 'hard', 'moderate', 'moderate', 'hard'],
   'col6': ['2015-10-23', '2016-06-23', '2019-10-13', '2019-10-23', '2011-02-16', '2019-01-14', '2021-01-23']
})

# Sort before slice!
df_sorted = df.set_index(['col1', 'col2']).sort_index()
df_sorted

Unnamed: 0_level_0,Unnamed: 1_level_0,col3,col4,col5,col6
col1,col2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Item0,Bronze,2,8.0,easy,2016-06-23
Item0,Gold,1,0.1,moderate,2015-10-23
Item0,Gold,5,35.0,easy,2019-10-13
Item1,Silver,4,14.0,hard,2019-10-23
ItemA,Bronze,6,9.0,moderate,2019-01-14
ItemA,Silver,2,8.0,moderate,2011-02-16
ItemC,Gold,10,0.321,hard,2021-01-23


#### 1. Slicing and subsetting with .loc and .iloc
- `df.loc['value1':'value5']` will slice on the **outer index** - different from slicing list: inclusive the last specified value, too!
- for slicing the **inner index**, use tuples to specify the first and the last position: `df.loc[('value1','value2'):('value3','value4')]` 

In [34]:
df_sorted.loc['Item1':'ItemA'] 

Unnamed: 0_level_0,Unnamed: 1_level_0,col3,col4,col5,col6
col1,col2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Item1,Silver,4,14.0,hard,2019-10-23
ItemA,Bronze,6,9.0,moderate,2019-01-14
ItemA,Silver,2,8.0,moderate,2011-02-16


In [35]:
df_sorted.loc[('Item0','Gold'):('ItemA', 'Bronze')] 

Unnamed: 0_level_0,Unnamed: 1_level_0,col3,col4,col5,col6
col1,col2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Item0,Gold,1,0.1,moderate,2015-10-23
Item0,Gold,5,35.0,easy,2019-10-13
Item1,Silver,4,14.0,hard,2019-10-23
ItemA,Bronze,6,9.0,moderate,2019-01-14


##### Slicing columns 
- `df.loc[:, 'column_from':'column_to']` by passing a second argument to `.loc()` (the ':' by itself means 'keep everything')
- columns and rows (indexes) at the same time with `df.loc[('Item0','Gold'):('ItemA', 'Bronze'), 'col3':'col4']`  

In [36]:
df_sorted.loc[:, 'col3':'col4'] 

Unnamed: 0_level_0,Unnamed: 1_level_0,col3,col4
col1,col2,Unnamed: 2_level_1,Unnamed: 3_level_1
Item0,Bronze,2,8.0
Item0,Gold,1,0.1
Item0,Gold,5,35.0
Item1,Silver,4,14.0
ItemA,Bronze,6,9.0
ItemA,Silver,2,8.0
ItemC,Gold,10,0.321


In [37]:
df_sorted.loc[('Item0','Gold'):('ItemA', 'Bronze'), 'col3':'col4']

Unnamed: 0_level_0,Unnamed: 1_level_0,col3,col4
col1,col2,Unnamed: 2_level_1,Unnamed: 3_level_1
Item0,Gold,1,0.1
Item0,Gold,5,35.0
Item1,Silver,4,14.0
ItemA,Bronze,6,9.0


##### Subsetting by row and column number - `.iloc()`
- `df.iloc[row_from:row_to, column_from:column_to]` - this works as slicing lists, non-inclusive the last value specified

In [38]:
df

Unnamed: 0,col1,col2,col3,col4,col5,col6
0,Item0,Gold,1,0.1,moderate,2015-10-23
1,Item0,Bronze,2,8.0,easy,2016-06-23
2,Item0,Gold,5,35.0,easy,2019-10-13
3,Item1,Silver,4,14.0,hard,2019-10-23
4,ItemA,Silver,2,8.0,moderate,2011-02-16
5,ItemA,Bronze,6,9.0,moderate,2019-01-14
6,ItemC,Gold,10,0.321,hard,2021-01-23


In [39]:
df.iloc[2:6,2:] 

Unnamed: 0,col3,col4,col5,col6
2,5,35.0,easy,2019-10-13
3,4,14.0,hard,2019-10-23
4,2,8.0,moderate,2011-02-16
5,6,9.0,moderate,2019-01-14


##### Slicing dates/timeseries with `.loc[]`
- set the date as index, then sort it with `df.set_index('date').sort_index()`
- slice by year only eg. `df.loc['2004':'2020']` - **non-inclusive, contrary to what the tutorial says!**

In [43]:
df_date_sorted = df.set_index('col6').sort_index()
df_date_sorted

Unnamed: 0_level_0,col1,col2,col3,col4,col5
col6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2011-02-16,ItemA,Silver,2,8.0,moderate
2015-10-23,Item0,Gold,1,0.1,moderate
2016-06-23,Item0,Bronze,2,8.0,easy
2019-01-14,ItemA,Bronze,6,9.0,moderate
2019-10-13,Item0,Gold,5,35.0,easy
2019-10-23,Item1,Silver,4,14.0,hard
2021-01-23,ItemC,Gold,10,0.321,hard


In [47]:
df_date_sorted.loc['2015':'2019'] # non-inclusive!

Unnamed: 0_level_0,col1,col2,col3,col4,col5
col6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-10-23,Item0,Gold,1,0.1,moderate
2016-06-23,Item0,Bronze,2,8.0,easy


In [51]:
df_date_sorted.loc['2015':'2019-02']

Unnamed: 0_level_0,col1,col2,col3,col4,col5
col6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-10-23,Item0,Gold,1,0.1,moderate
2016-06-23,Item0,Bronze,2,8.0,easy
2019-01-14,ItemA,Bronze,6,9.0,moderate


#### 2. Pivot tables - just dataframes with sorted indexes
- `df.pivot_table('col3', index = 'col2', columns = 'col1')`
- first argument is the column name containing values to aggregate
- **index** argument: lists the columns to group by and display in rows
- **columns** argument: lists the columns to group by and display in column
- using the default aggregation function, **mean**



In [60]:
df_pivot = df.pivot_table('col3', index = 'col6', columns = 'col1', fill_value = '-')
df_pivot

col1,Item0,Item1,ItemA,ItemC
col6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2011-02-16,-,-,2,-
2015-10-23,1,-,-,-
2016-06-23,2,-,-,-
2019-01-14,-,-,6,-
2019-10-13,5,-,-,-
2019-10-23,-,4,-,-
2021-01-23,-,-,-,10


In [62]:
df_pivot.loc['2016':'2020'] 

col1,Item0,Item1,ItemA,ItemC
col6,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-06-23,2,-,-,-
2019-01-14,-,-,6,-
2019-10-13,5,-,-,-
2019-10-23,-,4,-,-


 - methods for calculating summary statistics on a dataframe, such as mean, have an `axis =` argument
 - default value is "index," which means "calculate the statistic across rows
 - to calculate across rows, set `axis = 'columns'`

In [76]:
df

Unnamed: 0,col1,col2,col3,col4,col5,col6
0,Item0,Gold,1,0.1,moderate,2015-10-23
1,Item0,Bronze,2,8.0,easy,2016-06-23
2,Item0,Gold,5,35.0,easy,2019-10-13
3,Item1,Silver,4,14.0,hard,2019-10-23
4,ItemA,Silver,2,8.0,moderate,2011-02-16
5,ItemA,Bronze,6,9.0,moderate,2019-01-14
6,ItemC,Gold,10,0.321,hard,2021-01-23


In [74]:
df.mean()

col3     4.285714
col4    10.631571
dtype: float64

In [79]:
df.mean(axis = 'columns') # mean between colums with valid values, col3 and col4

0     0.5500
1     5.0000
2    20.0000
3     9.0000
4     5.0000
5     7.5000
6     5.1605
dtype: float64