## Row operations

In [7]:
import pandas as pd
testData = pd.read_csv("./data/startdata.csv",sep = ';', index_col = ['Date and time'], parse_dates = ['Date and time'], dayfirst = True)

To get a **one or more rows** from the DataFrame, use the **values** property with an index:

In [10]:
testData.values[3]

array(['Show', 61, 8, 'D'], dtype=object)

In [13]:
testData.values[3:5]

array([['Show', 61, 8, 'D'],
       ['Show', 26, 1, 'E']], dtype=object)

### Filtering rows based on a condition

You can perform functions on all the fields in a column at once and then do further processing with those results. Below we specify testData['Visibility'] which tells pandas to perform the actions on al the fields in the column. Here we will check if the 'visibility' column contains 'hide'. The **str.contains()** function will return a boolean per row.

In [2]:
rowsToHide = testData['Visibility'].str.contains("Hide").fillna(False)
rowsToHide.head(5)

pandas.core.series.Series

As contains() returns a boolean we get a Series of True/False variables which is the same size as the original DataFrame. Because there could be 'incorrect' strings in there, you can add **.fillna(false)** after the closing bracket of the contains() to set those to false in the returned Series, or any other acceptable value, depending on what type the function returns.

In a case like this, where you get a Series of booleans returned, you can reverse this with ~(name)
```python
rowsToShow = ~(rowsToHide)
```

These Series can be used with a DataFrame to only show/use those that are true by adding it in a sepparate set of square brackets. Note that here we do not use quotes inside the square brackets, as it is not a named column we are using.

In [93]:
testData[rowsToHide]

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 06:00:00,Hide,73,37,B
2019-01-04 12:00:00,Hide,95,39,C
2019-01-06 12:00:00,Hide,83,95,E
2019-01-08 18:00:00,Hide,91,61,B
2019-01-11 00:00:00,Hide,1,74,E


<br>You can also use them to select one or more columns:

In [6]:
testData[['Value', 'Value2']][rowsToHide]

Unnamed: 0_level_0,Value,Value2
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01 06:00:00,73,37
2019-01-04 12:00:00,95,39
2019-01-06 12:00:00,83,95
2019-01-08 18:00:00,91,61
2019-01-11 00:00:00,1,74


To understand the example above, it is best to split it up into parts:

1) *testData[['Value', 'Value2']]* : This actually returns a new DataFrame. On this new dataframe, the next operation is performed.

2) *[rowsToHide]* : On the dataframe that is returned by the first part perform a filter with the Series of True/False values to do the selection.

In this case the order of the two parts is of no importance, but if we filter inline, the order can become important, because the filter might have to run on rows that have been filtered out by the column selection.

Basically, we if we use only a single column, we can use .notation (dot-notation) which may make it easier to understand:
```python
testData[rowsToHide]['Value2']
testData[rowsToHide].Value2
testData['Value2'][rowsToHide]

testData.loc[rowsToHide, 'Value2']
```
All of these will return the same results. The .loc way of working was added here to show that it is different from the others in that it only uses one set of square brackets for it's selection.

<br>Above we made a variable rowsToHide, but if we only need this once, we can also do this inline:

In [95]:
testData['Value'][testData['Visibility'].str.contains("Hide")]

Date and time
2019-01-01 06:00:00    73
2019-01-04 12:00:00    95
2019-01-06 12:00:00    83
2019-01-08 18:00:00    91
2019-01-11 00:00:00     1
Name: Value, dtype: int64

We can also filter on multiple conditions, but it's important to note that the individual conditions need to be surrounded with round brackets '( )' to be sure that the condition checks are performed correctly.

In [12]:
testData[(testData['Visibility'].str.contains("Hide")) & (testData['Value3']=="B")]

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 06:00:00,Hide,73,37,B
2019-01-08 18:00:00,Hide,91,61,B


A usefull shortcut for when you need to check if a certain column contains one of several possible values is the .isin() function, which avoids having to write multiple comparison statements. 

In [14]:
testData[testData.Value3.isin(["A","B","C"])].head()

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 00:00:00,Show,30,42,A
2019-01-01 06:00:00,Hide,73,37,B
2019-01-01 12:00:00,Show,44,51,C
2019-01-02 12:00:00,Show,54,33,A
2019-01-02 18:00:00,Show,89,11,B


### Removing rows

There are several ways to remove rows from a DataFrame, like drop, creating a filtered copy of the dataframe, etc.

To remove a row based on it's position in the DataFrame, use the index property to specify which one to delete.
```python
testData.drop(testData.index[3], inplace=True)
```
If the row has a numeric or string index, we can use that to drop it
```python
testData.drop('rows index', inplace=True)
```

A simple way to remove rows based on a condition is 
```python
df = df[df.Value3 != 'A']
```

One thing to remember when trying to drop a row in a DataFrame that has a datetime as index is that *you can't drop the rows by passing the date as a string*. It has to be a datatime object.

### Row based operations

Normally if we run operations like min, max, mean, etc. on a DataFrame, we get the results per column. These can also be run on rows by passing the axis = 1 parameter.
```python
day_stats['min'] = data.min(axis = 1) 
```

### .loc for more complex selections or replacements

We can use these Series we created before, containing boolean values, in combination with **.loc** to set values in the selected rows to a different value, without changing the others:

In [9]:
testData.loc[rowsToHide, 'Value'] = 0
testData.head(5)

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 00:00:00,Show,30,42,A
2019-01-01 06:00:00,Hide,0,37,B
2019-01-01 12:00:00,Show,44,51,C
2019-01-01 18:00:00,Show,61,8,D
2019-01-02 00:00:00,Show,26,1,E


The above would not work without .loc, as it is not possible to set only a part of a row with the DataFrame['name'][boolean] addressing method

### .query and alternatives

We can create a new DataFrame with only the rows that evaluate to true for a certain condition:

In [3]:
testData[(testData['Visibility'] == "Hide") & (testData['Value3'] == 'E')]

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-06 12:00:00,Hide,83,95,E
2019-01-11 00:00:00,Hide,1,74,E


This can be written shorter and easier to read with **.query()**:


In [4]:
testData.query('Visibility == "Hide" & Value3 =="E"')

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-06 12:00:00,Hide,83,95,E
2019-01-11 00:00:00,Hide,1,74,E


### .apply and lambda's

With **.apply()** we can run a function on all rows in a DataFrame. We need to pass either a function or a lambda to it, which will then be executed on every row.

In [9]:
test = lambda x: x + 100

testData['Value'] = testData['Value'].apply(test)
testData.head()

Unnamed: 0_level_0,Visibility,Value,Value2,Value3
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-01-01 00:00:00,Show,330,42,A
2019-01-01 06:00:00,Hide,373,37,B
2019-01-01 12:00:00,Show,344,51,C
2019-01-01 18:00:00,Show,361,8,D
2019-01-02 00:00:00,Show,326,1,E


Next: [Column operations](07-Column_operations.ipynb) | [Content](00-Content.ipynb)