# Filtering

In [1]:
import pandas as pd
testData = pd.read_csv("./data/startdata.csv",sep = ';', index_col = ['Date and time'], parse_dates = ['Date and time'], dayfirst = True)

## Using the index

We can do some inital filtering on the index. If it index is a date you can filter on it: 
```python
DataFrame['2018-01-02'] 
```
gives all rows for januari 2nd, 2018.

In [13]:
testData['2019-01-02']

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-02 00:00:00,Show,26,1,E
2019-01-02 06:00:00,Show,38,63,F
2019-01-02 12:00:00,Show,54,33,A
2019-01-02 18:00:00,Show,89,11,B


## GroupBy

If there is a column with values that can be the same, the **groupby** function can be used to filter on those, e.g. postal codes. Once that grouping is done an aggregate function can be performed on that grouped data:

In [5]:
testData.groupby('Value3').mean()

Unnamed: 0_level_0,Value,Value2
Value3,Unnamed: 1_level_1,Unnamed: 2_level_1
A,38.0,43.0
B,69.625,41.625
C,59.25,34.75
D,55.875,58.625
E,48.125,39.0
F,34.125,44.875


To see how many rows were included per group, use the **.size()** function:

In [6]:
testData.groupby('Value3').size()

Value3
A    8
B    8
C    8
D    8
E    8
F    8
dtype: int64

If we want to change the index (A, B, ..) to something more understandable, we can do so the following way:
```python
testData.index = ["First", "Second", "Third", "Fourth", "Fifth", "Sixt"]
```

## Resample

The resample function resamples the DataFrame into new buckets based on the column specified, whose size is determined by the specified interval. Then the specified operation is performed on those. Here we won't specify a column, which then uses the index, and we resample to days (D)

In [3]:
testData.resample("D").sum()

Unnamed: 0_level_0,Value,Value2
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,208,138
2019-01-02,207,108
2019-01-03,264,211
2019-01-04,252,266
2019-01-05,85,71
2019-01-06,200,195
2019-01-07,181,118
2019-01-08,288,164
2019-01-09,224,163
2019-01-10,192,241


## Sorting

DataFrames also have a sorting function, alowing us to choose on which columns to sort and ascending or descending:
```python
.sort_values(['Column', 'Column2', ...], ascending=False)
```

In [10]:
testData.sort_values(['Value'], ascending=False).head(5)

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-03 12:00:00,Show,98,65,E
2019-01-04 12:00:00,Hide,95,39,C
2019-01-09 12:00:00,Show,93,14,E
2019-01-08 18:00:00,Hide,91,61,B
2019-01-03 00:00:00,Show,90,12,C


## .agg()


.agg is different from .groupby in that it can specify different operations to perform on different columns, and that it can perform multiple operations on those columns, generating multiple results per column.

In [22]:
testData.groupby('Value3').agg({'Value3':'count'})

Unnamed: 0_level_0,Value3
Value3,Unnamed: 1_level_1
A,8
B,8
C,8
D,8
E,8
F,8


In [24]:
testData.groupby('Value3').agg({'Value':['min', 'max'], 'Value2' : 'mean'})

Unnamed: 0_level_0,Value,Value,Value2
Unnamed: 0_level_1,min,max,mean
Value3,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,1,86,43.0
B,32,91,41.625
C,18,95,34.75
D,30,74,58.625
E,1,98,39.0
F,4,89,44.875


Next: [Row operations](06-Row_operations.ipynb)