### Intermezzo - How does the mode() function work, and how to extract useful information

**Aim:** Determine the mode for each categorical feature in a data frame and infere information from the results. Mode is the most frequently present value in the concerned feature.

We start by importing the Pandas package,

In [2]:
import pandas as pd

and create a data frame with some exemplary data.

In [3]:
df = pd.DataFrame({
    
    'animal': ["dog", "cat", "dog", "cat", "dog", "cat", "horse"],
    "id": ["a7", "a5", "a3", "a4", "a1", "a6", "a2"],
    "color": ["red", "red", "red", "red", "red", "red", "blue"]
})

Applying the `mode()` function, we get a data frame with most frequently present values in each of the columns. In the `animal` column cat and dog are present at the same frequency, i.e., 3. Both are listed in the data frame in alphabetical order. Cat is mentioned first even though dog was mentioned first in `df.animal`. In `df.id` all values are unique, so they all show up alphabetically ordered in the data frame below. A clear cut case - where there is just one the winner - is shown in `df.color` where red occurs most often.

In [4]:
df_mode = df.mode()

df_mode


Unnamed: 0,animal,id,color
0,cat,a1,red
1,dog,a2,
2,,a3,
3,,a4,
4,,a5,
5,,a6,
6,,a7,


To get candidates for the most frequently values we can take the first row.

In [5]:
df_mode.iloc[0]

animal    cat
id         a1
color     red
Name: 0, dtype: object

Though, we should be aware that there can be other values occuring at the same maximum frequency. Here, we count the number of features that have more than one value present at the maximum frequency.

In [6]:
df_mode.iloc[1].notna().sum()

2

A so-called `list comprehension` is an elegant way to identify the columns that have at least two or more value that occur at the same maximum frequency.

In [7]:
v_col = [
    
    df.columns[i]
    
    for i in range(len(df_mode.columns))
    
    if df_mode.notna().iloc[1,i]
]

v_col

['animal', 'id']

And, in case we want to know which values in these columns occur at the same maximum frequency, we can use a list comprehension and the `value_counts()` function. Normally, list comprehensions are written in a single line. Below I put certain elements at different lines, to make it easier to digest. In order to run the code it makes no difference. Note, you cannot put the line break at any point. What you put them between `(` and `)` you can, as shown below.

In [8]:
[
    pd.DataFrame(
        
        df[c_col].value_counts()

    ).sort_values(
        
        by = c_col, ascending=False
    )
    
    for c_col in v_col
]

[       animal
 dog         3
 cat         3
 horse       1,
     id
 a7   1
 a5   1
 a3   1
 a4   1
 a1   1
 a6   1
 a2   1]