In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML
from matplotlib import pyplot as plt
pd.set_option('max_colwidth', 20)
plt.style.use('fivethirtyeight')
%matplotlib inline

# 4 - Accessing the data

## Outline

Goal: *Provide an overview of the available methods for indexing, selection and filtering of the data.*

Key topics:

- Access to the data
- Quick access methods
- Indexing attributes
- Filtering the data
- Managing axis labeling

## Accessing the data

The axis labeling information in pandas objects serves many purposes:

- identifies data (i.e. provides metadata)
- enables automatic and explicit data alignment
- allows intuitive getting and setting of subsets of the data set

In this lecture, we focus on the final point; how one can leverage the axis labeling for selecting the data.

Pandas provides the following approaches for accessing the data stored in a pandas data structure:

- quick access methods
- data iterators (not discussed here)
- indexing attributes and methods

The first two are primarily concerned with getting the data, the latter also allows both getting an setting of values in DataFrames and Series.

## Quick access methods

These methods allow us to take a quick peek at the data:

<table style="border-collapse:collapse;border-spacing:0"><tr><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">quick access method</th><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">description</th></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.head(n)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">returns first n rows</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.tail(n)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">returns last n rows</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.nlargest(n, columns,…)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">get the rows sorted by the n largest values of columns</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.nsmallest(n, columns,…)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">get the rows sorted by the n smallest values of columns</td></tr><tr><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.sample(n, frac, axis, …)`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">return random sample of rows or columns</td></tr></table>

Show the first 2 rows:

In [2]:
films = pd.read_excel('data/dutch_films.xlsx')
films['random_number'] = np.random.random(len(films))
films.head(2)

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
0,Iep!,Ineke Houtman,Huub Stapel Joke...,Familiefilm,17 februari,0.012898
1,Gangsterboys,Paul Ruven,Georgina Verbaan...,Komedie,18 februari,0.833662


Show the first 2 rows with largest `random_number`:

In [3]:
films.nlargest(2, columns='random_number')

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
26,New Kids Turbo,Steffen Haars Fl...,Huub Smit Tim Ha...,Komedie,9 december,0.974559
13,Foeksia de Miniheks,Johan Nijenhuis,Rachelle Verdel ...,Fantasy,6 oktober,0.935438


Draw a sample of 5 items from `Genre` column without replacement:

In [36]:
se = films['Genre']
se.sample(5, replace=False)

r6       Drama
r18      Drama
r5       Drama
r9     Komedie
r21     Horror
Name: Genre, dtype: object

## Indexing attributes

Indexing attributes provide access to the data stored in pandas objects through

- label based indexing
- position based indexing
- boolean indexing 

these can be used with several different types of indexers.


An overview of the indexing attributes:

<table style="border-collapse:collapse;border-spacing:0">
    <tr>
        <th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal" rowspan="2">attribute</th>
        <th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top" colspan="3">allowed indexers</th><th style="font-family:Arial, sans-serif;font-size:18px;font-weight:bold;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal" rowspan="2">description</th>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;font-weight:bold;text-align:center;vertical-align:top">value</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;font-weight:bold;text-align:center;vertical-align:top">list</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;font-weight:bold;text-align:center;vertical-align:top">slice</td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal">`s.<idx>`<br>`df.<col>`</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">label</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal">access a column (or index) as an attribute</td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`s[…]`</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, int, bool</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">basic indexer for selecting values</td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal">`df[…]` (cols)<br></td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">label</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">label, int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal">basic indexer for selecting columns</td></tr><tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`df[…]` (rows)</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">bool</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">basic indexer for selecting rows</td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.loc[…]`</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, bool</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">label-location based indexer</td>    
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.iloc[…]`</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">int, bool</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">integer-location based indexer</td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.ix[…]`</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, int, bool</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label, int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">label-location based indexer, with integer position fallback <p>(will be removed in future version of pandas)</p></td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.at[…]`</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">label</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">fast label-location based scalar accessor</td>
    </tr>
    <tr>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">`obj.iat[…]`</td><td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">int</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;text-align:center;vertical-align:top">-</td>
        <td style="font-family:Arial, sans-serif;font-size:18px;padding:5px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;vertical-align:top">fast integer-location based scalar accessor</td>
    </tr>
</table>

Next we will go over some examples different data selections.

### Selecting column(s)

Select single column as Series:

In [5]:
s = films['Titel']
s = films.loc[:, 'Titel']
s = films.iloc[:, 0]
s = films.Titel
s.head(3)

0                   Iep!
1           Gangsterboys
2    Snuf en de IJsvogel
Name: Titel, dtype: object

Select single column as DataFrame:

In [6]:
df = films[['Titel']]
df = films.loc[:, ['Titel']]
df = films.iloc[:, [0]] 
df.head(3)

Unnamed: 0,Titel
0,Iep!
1,Gangsterboys
2,Snuf en de IJsvogel


Select multiple columns:

In [9]:
df = films[['Titel', 'Regisseur', 'Acteurs']]
df = films.loc[:, ['Titel', 'Regisseur', 'Acteurs']]
df = films.loc[:, 'Titel':'Acteurs']  # incl. 'Regisseur'
df = films.iloc[:, [0, 1, 2]]
df = films.iloc[:, :3]  # excl. 3

bool_idx = films.columns.isin(['Titel', 'Regisseur', 'Acteurs'])
df = df.loc[:, bool_idx]

df.head(3)

Unnamed: 0,Titel,Regisseur,Acteurs
0,Iep!,Ineke Houtman,Huub Stapel Joke...
1,Gangsterboys,Paul Ruven,Georgina Verbaan...
2,Snuf en de IJsvogel,Steven de Jong,Ydwer Bosma Joos...


### Selecting row(s)

First, let's add row labels to the index of the DataFrame:

In [10]:
films.index = ['r{}'.format(i) for i in range(len(films))]
films.head(5)

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
r0,Iep!,Ineke Houtman,Huub Stapel Joke...,Familiefilm,17 februari,0.012898
r1,Gangsterboys,Paul Ruven,Georgina Verbaan...,Komedie,18 februari,0.833662
r2,Snuf en de IJsvogel,Steven de Jong,Ydwer Bosma Joos...,Familiefilm,20 februari,0.207152
r3,Zwart water,Elbert van Strien,Hadewych Minis B...,Horror,11 maart,0.531159
r4,First Mission,Boris Pavel Conen,Anniek Pheifer T...,Drama,25 maart,0.22096


Selecting single row as Series:

In [11]:
s = films.loc['r4']
s = films.iloc[4]
s

Titel                   First Mission
Regisseur           Boris Pavel Conen
Acteurs           Anniek Pheifer T...
Genre                           Drama
Bijzonderheden               25 maart
random_number                 0.22096
Name: r4, dtype: object

Selecting single row as DataFrame:

In [12]:
df = films.loc[['r4']]
df = films.iloc[[4]]
df = films[films.index == 'r4']
df

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
r4,First Mission,Boris Pavel Conen,Anniek Pheifer T...,Drama,25 maart,0.22096


Selecting multiple rows:


In [13]:
df = films['r2':'r4'] # incl 'r4'
df = films[2:5] # excl 5
df = films.loc['r2':'r4'] 
df = films.loc[slice('r2', 'r4')]
df = films.iloc[2:5]
df = films.loc[films.index.isin(['r2', 'r3', 'r4'])]
df

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
r2,Snuf en de IJsvogel,Steven de Jong,Ydwer Bosma Joos...,Familiefilm,20 februari,0.207152
r3,Zwart water,Elbert van Strien,Hadewych Minis B...,Horror,11 maart,0.531159
r4,First Mission,Boris Pavel Conen,Anniek Pheifer T...,Drama,25 maart,0.22096


### Selecting a cross-section

In [14]:
df = films.loc['r2':'r4', 
               ['Titel', 'Regisseur', 'Acteurs']]
df = films.loc[slice('r2','r4'),
               ['Titel', 'Regisseur', 'Acteurs']]
df = films.iloc[2:5, 0:3]
df.head(3)

Unnamed: 0,Titel,Regisseur,Acteurs
r2,Snuf en de IJsvogel,Steven de Jong,Ydwer Bosma Joos...
r3,Zwart water,Elbert van Strien,Hadewych Minis B...
r4,First Mission,Boris Pavel Conen,Anniek Pheifer T...


### Selecting a single value

In [15]:
val = films.loc['r2', 'Titel']
val = films.iloc[2, 0]
val = films.iat[2, 0]
val = films.at['r2', 'Titel']
val

'Snuf en de IJsvogel'

### Pitfall: chained indexing

Try to avoid chaining indexing operations; especially when modifying the resulting object.  

Example of chained indexing and preferred alternative:

In [16]:
val = df.iloc[2]['Titel']  #chained indexing
val = df.loc[df.index[2], 'Titel']  # preferred! 
val

'First Mission'

Why should we avoid this? 
- with chaining it is often unclear if the result is a copy or view 
- this can cause problems when one wants to modify the original object

An example of assignment failing due to chained indexing (notice the warning):

In [17]:
films.loc['r2']['Titel'] = 'Some new title'  # modifying a copy!
films.loc['r2']['Titel']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


'Snuf en de IJsvogel'

now doing the assignment correctly:

In [18]:
films.loc['r2', 'Titel'] = 'Another new title'  # modifying a view!
films.loc['r2', 'Titel']

'Another new title'

The `SettingWithCopy` warning can be annoying, however it can safe you from this pitfall!

### Which indexing attribute to use?

Use the following ordering as a rule of thumb:
 
 1. `obj.loc[]`
 2. `obj.iloc[]`
 3. `[]`

## Filtering the data

Filtering allows the selection of subsets of data that statisfy some specified criterion. 



The most common approaches to filtering are:

- filtering with comparison operators
- filtering with boolean methods 
    * e.g. `obj.isin()`, `obj.isnull()`

Next we will go over these methods in more detail.

###  Filtering with comparison operators

Comparison operators allows for easy filtering through boolean indexing:

- the available operators: `<`, `<=`, `==`, `!=`, `>=`, `>`
- comparison is usually done with Series object (=single DataFrame column)
- combining multiple comparisons is allowed:
  - combine using boolean operators: `|` (=or), `&` (=and)
  - each comparison must be enclosed in braces, eg. `df[([df['a'] > 1) & (df['b']  < 5)]`
- negations can be done with the `~` (not) boolean operator

Select films with `random_number` larger than 0.9:

In [19]:
df = films[films['random_number'] > .9]
df

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
r13,Foeksia de Miniheks,Johan Nijenhuis,Rachelle Verdel ...,Fantasy,6 oktober,0.935438
r24,Het Geheim,Joram Lürsen,Theo Maassen Cha...,Familiefilm,1 december,0.912344
r26,New Kids Turbo,Steffen Haars Fl...,Huub Smit Tim Ha...,Komedie,9 december,0.974559


Select films with `random_number` larger than 0.9 and `Genre` is drama:

In [20]:
df = films[(films['random_number'] > .9) & 
           (films['Genre'] == 'Drama')]
df

Unnamed: 0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number


### Filtering with boolean methods

The following instance methods output boolean arrays, they also can be used for boolean indexing:
- `obj.isin()`
- `obj.duplicated()`
- `obj.isnull()`
- `obj.notnull()`
- `s.str.<method>`


First, let's create a new dataset out of our films data:

In [21]:
from ast import literal_eval
titles = pd.DataFrame({
    'full_title': films['Titel'],
    'score': films['random_number'],
    'genre': films['Genre'].str.split().str[0].str.lower(),
    'first_word': films['Titel'].str.split().str[0].str.lower()
})
titles.head(3)

Unnamed: 0,full_title,score,genre,first_word
r0,Iep!,0.012898,familiefilm,iep!
r1,Gangsterboys,0.833662,komedie,gangsterboys
r2,Another new title,0.207152,familiefilm,another


Let's see which columns have null values:

In [22]:
titles.isnull().sum()

full_title    0
score         0
genre         0
first_word    0
dtype: int64

Select rows without null values:

In [23]:
df = titles[titles['first_word'].notnull() &
            titles['full_title'].notnull()]
df.shape

(28, 4)

alternative method using boolean reductions:

In [24]:
df = titles[titles.notnull().all(axis=1)]
df.shape

(28, 4)

Select rows for which `genre` is fantasy or horror:

In [25]:
df = titles[titles['genre'].isin(['fantasy', 'horror'])]
df

Unnamed: 0,full_title,score,genre,first_word
r3,Zwart water,0.531159,horror,zwart
r13,Foeksia de Miniheks,0.935438,fantasy,foeksia
r21,Sint,0.588477,horror,sint


De-duplicate based on the `genre`:

In [26]:
df = titles[~titles.duplicated(subset=['genre'])]
df

Unnamed: 0,full_title,score,genre,first_word
r0,Iep!,0.012898,familiefilm,iep!
r1,Gangsterboys,0.833662,komedie,gangsterboys
r3,Zwart water,0.531159,horror,zwart
r4,First Mission,0.22096,drama,first
r12,De Leugen,0.55148,documentairefilm,de
r13,Foeksia de Miniheks,0.935438,fantasy,foeksia
r14,LelleBelle,0.504052,romantiek,lellebelle


Select rows for which first words starts with a `'g'`:

In [27]:
df = titles[titles['first_word'].str.startswith('g')] 
df

Unnamed: 0,full_title,score,genre,first_word
r1,Gangsterboys,0.833662,komedie,gangsterboys


## Filtering with functions

Indexers `.loc` and `.iloc` also accept a function:

In [28]:
def find_horror(df):
    return df['genre'] == 'horror'

titles.loc[find_horror].head()

Unnamed: 0,full_title,score,genre,first_word
r3,Zwart water,0.531159,horror,zwart
r21,Sint,0.588477,horror,sint


Instead of using a function, you can also use an anonymous `lambda` function that works on the whole DataFrame:

In [29]:
titles.loc[lambda x: x['genre'] == 'horror']

Unnamed: 0,full_title,score,genre,first_word
r3,Zwart water,0.531159,horror,zwart
r21,Sint,0.588477,horror,sint


This syntax allows _method chaining_:

In [30]:
df = (
    titles
    .groupby(['genre'])['first_word'].value_counts()
    .reset_index(name='counts')
    .loc[lambda x: (x['counts'] > 1)]
)
df

Unnamed: 0,genre,first_word,counts
1,drama,de,3


Another chaining example:

In [31]:
df = (
    titles
    .assign(num_words=lambda x: x['full_title'].str.split().str.len())
    .loc[lambda x: x['num_words'] > 3]
    .sort_values(by='num_words', ascending=False)
)      
df

Unnamed: 0,full_title,score,genre,first_word,num_words
r5,Kom niet aan mij...,0.843056,drama,kom,5
r15,Sinterklaas en h...,0.071893,familiefilm,sinterklaas,5
r6,De vliegenierste...,0.083725,drama,de,4
r20,Snuf en het spoo...,0.129858,familiefilm,snuf,4


## Managing axis labeling

We have seen that the axis labeling provides an intuitive infrastructure for indexing and filtering of the data. 

### Set and reset the index

To set and reset the index of an existing DataFrame use:
```python
df.set_index(keys, drop=True, append=False, 
             inplace=False, ...)
```
```python
df.reset_index(level=None, drop=False, 
               inplace=False, ...)
```

Let's start by resetting our old index:

In [32]:
df = films.reset_index()
df.head(2)

Unnamed: 0,index,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
0,r0,Iep!,Ineke Houtman,Huub Stapel Joke...,Familiefilm,17 februari,0.012898
1,r1,Gangsterboys,Paul Ruven,Georgina Verbaan...,Komedie,18 februari,0.833662


Set a new index in-place (without creating a new object):

In [33]:
df.set_index('Titel', inplace=True)
df.head(2)

Unnamed: 0_level_0,index,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
Titel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Iep!,r0,Ineke Houtman,Huub Stapel Joke...,Familiefilm,17 februari,0.012898
Gangsterboys,r1,Paul Ruven,Georgina Verbaan...,Komedie,18 februari,0.833662


Appending to an existing index:

In [34]:
appended_df = df.set_index('index', append=True)
appended_df.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
Titel,index,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Iep!,r0,Ineke Houtman,Huub Stapel Joke...,Familiefilm,17 februari,0.012898
Gangsterboys,r1,Paul Ruven,Georgina Verbaan...,Komedie,18 februari,0.833662


Reset a specific index level:

In [35]:
reset_df = appended_df.reset_index('Titel')
reset_df.head(2)

Unnamed: 0_level_0,Titel,Regisseur,Acteurs,Genre,Bijzonderheden,random_number
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
r0,Iep!,Ineke Houtman,Huub Stapel Joke...,Familiefilm,17 februari,0.012898
r1,Gangsterboys,Paul Ruven,Georgina Verbaan...,Komedie,18 februari,0.833662


## Exercises: [lab 4 - Accessing the data](lab_04_accessing_the_data.ipynb)