In [None]:
from slide_tools import hide_code_in_slideshow

# Selecting and Filtering

## Applied Review

### Data Structures and the DataFrame Class

* Data is frequently represented inside a **DataFrame** - a class from the `pandas` library that is similar to a *table*

* Each DataFrame object has rows and columns

* The DataFrame class has methods (built-in operations) for common tasks, and attributes (stored data) of common information

### Importing Data

* Tabular data can be imported into DataFrames using the `pd.read_csv()` function - there are parameters for different options

In [None]:
import pandas as pd
planes_df = pd.read_csv('data/planes.csv')
planes_df.head()

* Common parameters are `sep`, `header` and `parse_dates`

## Selecting Data

### Subsetting Dimensions

* We don't always want all of the data in a DataFrame, so we need to take subsets of the DataFrame.

* In general, **subsetting** is extracting a small portion of a DataFrame -- making the DataFrame smaller.

* Since the DataFrame is two-dimensional, there are two dimensions on which to subset.

**Dimension 1:** We may only want to consider certain *variables*.

For example, in this airline dataset we may only care about the `year` and `engines` variables:

In [None]:
hide_code_in_slideshow()

planes_df.head().style.apply(lambda x: ['background: lightblue' if x.name == 'engines' or x.name == 'year' else '' for i,_ in x.iteritems()]).format('{:.1f}', subset = ['year',])

We call this **selecting** columns/variables -- this is similar to SQL's `SELECT` statement

**Dimension 2:** We may only want to consider certain *cases*.

For example, we may only care about the cases where the manufacturer is Embraer.

In [None]:
hide_code_in_slideshow()
planes_df.head().style.apply(lambda x: ['background: lightgreen' if i in [0, 4] else '' for i,_ in x.iteritems()]).format('{:.1f}', subset = ['year',])

We call this **filtering** or **slicing** -- this is similar to SQL's `WHERE` statement

And we can combine these two options to subset in both dimensions -- the `year` and `engine` variables where the manufacturer is Embraer

In [None]:
hide_code_in_slideshow()

planes_df.head().style.apply(lambda x: ['background: lightgreen' if i in [0, 4] else '' for i,_ in x.iteritems()]).apply(lambda x: ['background: lightblue' if x.name == 'engines' or x.name == 'year' else '' for i,_ in x.iteritems()]).apply(lambda x: ['background: teal' if i in [0, 4] and (x.name == 'engines' or x.name == 'year') else '' for i,_ in x.iteritems()]).format('{:.1f}', subset = ['year',])

In [None]:
hide_code_in_slideshow()
planes_df.head().style.apply(lambda x: ['background: teal' if i in [0, 4] and (x.name == 'engines' or x.name == 'year') else '' for i,_ in x.iteritems()]).format('{:.1f}', subset = ['year',])

### Subsetting into a New DataFrame

In this example we want to do two things on our airplane sample dataset:

  1. **select** the `year` and `engines` variables
  2. **filter** to cases where the manufacturer is Embraer

But we don't just want to highlight the the desired cells. We want to return them as a new DataFrame we can continue to work with.

In other words, we want to turn this:

In [None]:
planes_df.head()

Into this:

In [None]:
hide_code_in_slideshow()
sample_df = planes_df.head() 
cols = ['year', 'engines']
filt = sample_df['manufacturer'] == 'EMBRAER'
sample_df[cols][filt]

So we really have a third need: return the resulting DataFrame so we can continue our analysis:

  1. **select** the `year` and `engines` variables
  2. **filter** to cases where the manufacturer is Embraer
  3. Return a DataFrame to continue the analysis

## Selecting Variables

We can select a single variable using bracket subsetting notation.  
Let's use the familiar movies dataset again to illustrate this:

In [None]:
movies_df = pd.read_csv('data/movies.csv')

In [None]:
movies_df['director_name'].head()

Notice the `head()` method also works on `movies_df['director_name']` to return the first five elements.

<font class="question">
    <strong>Question</strong>:<br><em>What is the data type of <code>movies_df['director_name']</code>?</em>
</font>

<font class="question">
    <strong>Question</strong>:<br><em>What is the data type of <code>movies_df['director_name']</code>?</em>
</font>

The operation `df['col_name']` returns on object of type `pandas.core.series.Series`. That is, it returns a "Series", rather than a DataFrame.

In [None]:
type(movies_df['director_name'])

That is okay -- the Series is like the younger sibling of a DataFrame:

* A Series is a **one-dimensional** data structure -- similar to a Python `list`

* Note that all objects in a Series must be of the **same type**

* Each DataFrame can be thought of as a collection of equal-length Series (plus Index)

This visual representation of a Series and DataFrame may be helpful:

<center>
<img src="images/dataframe-series.png" alt="dataframe-series.png" width="600" height="600">
</center>

We can select a single variable and return the values as a Series by passing the variable name as a `string`:

In [None]:
movies_df['director_name'].head(3)

In [None]:
type(movies_df['director_name'])

We can return one or more variables as a DataFrame by passing a `list` of variables names:

In [None]:
movies_df[['director_name']].head(3)

In [None]:
type(movies_df[['director_name']].head())

<font class="question">
    <strong>Question</strong>:<br><em>What's another advantage of passing a <code>list</code> rather than a <code>string</code>?</em>
</font>

Passing a list into the bracket subsetting notation allows us to select multiple variables at once:

In [None]:
movies_df[['title','director_name']].head()

As another example, let's first set the index again. And then assume we are only interested in the `year` each movie was produced, it's `duration` and the  `budget` used for producing it:

In [None]:
movies_df = movies_df.set_index('title')
movies_df[['year', 'duration', 'budget']].head()

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>


1. Welche Art von Objekt ist ein DataFrame column?
2. Wähle 3 Variablen aus dem Movies Dataset aus und speichere diese unter einer neuen Variable `columns_of_interest` ab.
3. Wo ist der Fehler in dem folgenden Code?

   ```python
   planes_df['type', 'model']
   ```

## Subsetting Cases

When we subset cases/records/rows two names are often used for this activity: **slicing** and **filtering**

But *these two are not the same*:

  * **slicing**, similar to row **indexing**, subsets cases by the _value of the Index_
  * **filtering** subsets cases using a _conditional test_

In [None]:
import pandas as pd
movies_df = pd.read_csv('data/movies.csv')

### Slicing Cases

Remember that all DataFrames have an Index:

In [None]:
movies_df.head()

We can **slice** cases/rows using the values in the Index and bracket subsetting notation.  
It's recommended to use the `.loc` attribute of a dataframe to slice cases/rows:

In [None]:
movies_df.loc[0:3]

Note that the last element here is _inclusive_.

We can also pass a `list` of Index values:

In [None]:
movies_df.loc[[2, 4, 6, 8]]

Initially the `.loc` attribute can seem strange. `loc` stands for "location". 

Wheres `[]` or _just the brackets_ notation is primarily used to select columns, the `.loc[]` syntax is used to access _rows_ of a DataFrame.  

That said `.loc` is more flexible and allows to select _both_ rows and columns like so: 
```python
df.loc['row_name','col_name']
```

Or multiple rows/columns by passing in a list of row/column names:
```python
df.loc[['row_1','row_2'],['col_1','col_2']]
```

So everything that can be done with `[]` can also be done with `.loc[]`, but not the other way around. We can't use _just the brackets_ to explicitly select rows. So in practice `[]` is mostly a convenient shortcut for selecting columns which is what we do most of the time.  

Both notations support filtering of records, which we will look at next.

### Filtering Cases

We can **filter** cases/rows _using a logical sequence equal in length_ to the number of rows in the DataFrame.

Continuing our example, assume we want to determine for each movie whether it was produced in America, i.e. whether the `country` is USA. We can use the `country` Series together with a logical equivalency test to find the result for each row:

In [None]:
movies_df['country'].head() == 'USA'

We can use this resulting logical sequence to **filter** cases -- rows that are `True` will be returned while those that are `False` will be removed:

In [None]:
filt = movies_df['country'] == 'USA'
movies_df[filt].head()

This is the same as doing:

In [None]:
movies_df.head()[[True,True,False,True,True]]

This technique also works using the `.loc` attribute:

In [None]:
filt = movies_df['country'] == 'USA'
movies_df.loc[filt].head()

Any conditional test can be used to **filter** DataFrame rows:

In [None]:
filt = movies_df['year'] < 2002
movies_df.loc[filt].head()

And **multiple conditional tests** can be combined using logical operators:

In [None]:
filt = (movies_df['year'] < 2002) & (movies_df['year'] > 1998)
movies_df.loc[filt].head(4)

Note that each condition is wrapped in parentheses -- this is required when multiple tests are written directly next to each other.

Another approach that doesn't require the parantheses and helps with readability is to store each test's results in a separte filter variable and then combine them via another, final filter:

In [None]:
filt1 = movies_df['year'] < 2002
filt2 = movies_df['year'] > 1998
filt  = filt1 & filt2
movies_df.loc[filt].head(4)

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

1. Was ist der Unterschied zwischen **slicing** und **filtering**?
2. Fülle die Lücken im folgenden Code Template aus, um alle Filme zu identifizieren, the einen IMDB Score von höher als 9 haben.
```python
filt = movies_df['___'] > 9
movies_df.loc[filt]
```
3. Führe den Code aus. Wie viele Filme mit einem IMDB Rating über 9 gibt es? Ist das Ergebnis überraschend?
4. Ändere den Filter so ab, dass nur Filme mit Score > 8.8 und mindestens 100 Reviews im Ergebnis Set übrig bleiben. Wie viele Filme in unserem Dataset erfüllen diese Anforderung?

#<span style='color: white'>
filt1 = movies_df['imdb_score'] > 8.8
filt2 = movies_df['num_reviews'] > 100
filt = filt1 & filt2
movies_df.loc[filt].shape
#</span>

### Selecting Variables and Filtering Cases

If we want to select variables and filter cases at the same time, we have two options:

1. Sequential operations
2. Simultaneous operations

#### Sequential Operations

We can use what we've previously learned to select variables and filter cases in multiple steps:

In [None]:
filt = movies_df['imdb_score'] > 8.5
movies_df_filtered = movies_df.loc[filt]
cols = ['title', 'director_name', 'imdb_score']
movies_df_filtered_and_selected = movies_df_filtered[cols]
movies_df_filtered_and_selected.head()

This is a good way to learn how to select and filter one step at a time.

#### Sequential Operations + Chaining

We can also chain these operations one after another like so:

In [None]:
filt = movies_df['imdb_score'] > 8.5
cols = ['title', 'director_name', 'imdb_score']
movies_df.loc[filt][cols].head()

#### Simultaneous Operations

Finally, we can also do both selecting and filtering in a single step with `.loc`:

In [None]:
filt = movies_df['imdb_score'] > 8.5
cols = ['title', 'director_name', 'imdb_score']
movies_df.loc[filt,cols].head()

The general syntax for `.loc` is `df.loc[rows,cols]` where `rows` selects the desired rows and can either be a single row label, a list of row labels, or a filter like we just saw. Similarly `cols` can be either a single column label or a list of column labels.

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn
</font>

1. Filtere das Movie Dataset so, dass nur Filme die im Jahr 1998 produziert wurden, Filme die zwischen 2003 und 2005 produziert wurden und mindestens zwei Stunden Laufzeit haben übrig bleiben. Behalte zudem nur den Namen des Film, den Direktor, das Produktionsjahr, die Laufzeit, sowie das für die Produktion verbrauchte Budget als Variablen bei.
2. Lade ein weiteres Beispiel Dataset, z.b. `bikes.csv`, `airbnb.csv` oder `employee.csv` aus dem `../data/` Ordner und versuche ein paar Selektionen & Filterungen durchzuführen.

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>