In [1]:
import pandas as pd
import seaborn as sns

# Pandas and masks

I'm a little concerned I didn't explain masks and how Pandas filters columns that well the other day. So I've made a notebook to try and explain them a little better.

## A small history and context of Python
[Python](https://en.wikipedia.org/wiki/Python_(programming_language)) is a general purpose programming language conceived in the late 1980s by Guido van Rossum. It has [20 guiding principles](https://peps.python.org/pep-0020/). Python is an [interpreted language](https://en.wikipedia.org/wiki/Interpreter_(computing)), meaning an *interpreter* only translates source code to machine code at runtime (when the program is being run). This constrasts with a [compiled language](https://en.wikipedia.org/wiki/Compiler), where this translation happens statically at compile time, prior to code being run.

Examples of interpreted languages include Python and JavaScript. Examples of compiled languages include C, C++, Rust and Ocaml. There are many languages of each type - too many to list in this notebook.

Of probable use, but unrelated to this notebook, is [PEP8](https://peps.python.org/pep-0008/), a style guide to Python code. Following this should help increase the readability of your code.

## A brief introduction to Types
Most programming languages have the concept of a *type*. More formally, a *type* is a set of rules dictating what an expression can and cannot do. Less formally, a *type* is some information about expressions that another program can use to judge whether a program makes sense. We will expand on these notions presently.

Informally, we can agree that members of the following sets are different:

$$
\begin{align}
\mathbb{R}\\
\mathbb{R}\times\mathbb{R}
\end{align}
$$

The operation $(2, 3) + 7$ makes no sense. Similarly, $x + (y, z)$ has no obvious definition. A definition of $+$ could be constructed, but there is no obvious definition.

Similarly, consider the following (Python) function definition

In [2]:
def add(x, y):
    return x + y

$x+y$ makes sense if both $x$ and $y$ are numbers, but if $x$ is a number and $y$ is a string, then this makes little sense. You could do an *implicit* conversion, where the language made the conversion silently, but this might not be desirable.

Some languages include extra type annotations to allow for these judgements to be made before the program is run. These may look something like this:

In [3]:
def add(x: int, y: int) -> int:
    return x + y

In Python, this is legal syntax but they aren't checked by the computer at all. In a different language, they can be subject to very powerful checks that ensure your program 'makes sense' before code is run. For example, if the checks were to come across

In [4]:
add(5, "hello")

TypeError: unsupported operand type(s) for +: 'int' and 'str'

then they could flag up that this didn't make sense and alert the programmer. Conversely, the checks wouldn't flag

In [None]:
add(5, 7)

because this 'makes sense.'

Types are a very powerful aid to the programmer, and I think any piece of software worth its salt uses a language with a strong type checker. However, I don't make the budgeting decisions around here, so my opinion means very little.

## Python and Types
Python uses what is called [duck typing](https://en.wikipedia.org/wiki/Python_(programming_language)#Typing). You've definitely heard me say 'if it looks like a duck, and quacks like a duck, then it probably is a duck.' In practice, this means if an object has all methods required by a type, then it can act as a member of that type.

Python is dynamically typed. This means there are no compile time checks, which means the first time you could know about a type error is when your code crashes in production from an unexpected input. How frustrating. This also means the type of things can change in hard-to-predict ways during program execution. This can be an issue, but is often fine if you remain disciplined with your programs. Something Python's syntax and style guides can help you achieve.

Contrary to all I've said previously, Python *is* strongly typed. This means if it comes across an operation that doesn't make sense (like trying to add a number to a word), it won't silently try to make sense of the expression. It will just crash. This is *extremely* frustrating when dealing with large programs, which could often use a computer sense checking everything. By and large, strong typing is probably the better choice to make, not least because of the insanity that lies down weak typing. See JavaScript, where weak equality is *not* transitive (i.e. a == b and b == c does not necessarily mean that a == c) and expressions can sometimes behave differently depending on the type of the first variable. I'm getting worked up just typing (haha) about it. 

However, all this ducktyping guff without static checks does mean that, in well written libraries, Python *just works*. This is huge, and I can't stress how nice a feature of the language this is. I've never had such a welcoming and easy to pick up experience as Python, and the type system probably plays a huge, understated part in that.

Some key types in Python include **bool**, **int**, **float**, **str** and **tuple**.

## Pandas DataFrames
A Pandas DataFrame is a data structure containing information about some data. Often `read_csv` or `read_sql` is used to read data to the DataFrame. A crucial difference between a DataFrame and the `csv` file is that the DataFrame is held in RAM, whereas the `csv` is held in secondary storage (e.g. on the hard drive). Holding the data in RAM means its much quicker to perform operations on, but RAM loses data once turned off, whilst secondary storage does not.

A Pandas DataFrame has column names and an index. The index is numerical and starts at 0,

Let's load an example dataset. This one is called Iris, and it was an dataset used for training and testing early machine learning algorithms.

In [5]:
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


We can generate summary statistics by 

In [7]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


However, we're focusing on a particular aspect of Pandas dataframes here.

### Pandas wearing masks
A *mask* in Pandas is a boolean series used to filter a dataframe to specific entries. A mask can either be applied across rows, or across columns. Examples of masks include

In [8]:
iris['petal_width']

0      0.2
1      0.2
2      0.2
3      0.2
4      0.2
      ... 
145    2.3
146    1.9
147    2.0
148    2.3
149    1.8
Name: petal_width, Length: 150, dtype: float64

where we only consider the petal length, and

In [9]:
iris[iris['sepal_length'] == 5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
4,5.0,3.6,1.4,0.2,setosa
7,5.0,3.4,1.5,0.2,setosa
25,5.0,3.0,1.6,0.2,setosa
26,5.0,3.4,1.6,0.4,setosa
35,5.0,3.2,1.2,0.2,setosa
40,5.0,3.5,1.3,0.3,setosa
43,5.0,3.5,1.6,0.6,setosa
49,5.0,3.3,1.4,0.2,setosa
60,5.0,2.0,3.5,1.0,versicolor
93,5.0,2.3,3.3,1.0,versicolor


where we only consider sepal legnths equal to 5.

More complex filters can be applied (such as >, < or any arbitrary boolean function), but we want to explain the *semantics* (what's really going on) to help explain Pandas behaviour and help you reason a little better about how to use Pandas.

### So, what is a mask anyways?
As I said before, a mask is a boolean series. What does this mean though?

Let's take the second code example. The inner section (`iris['sepal_length'] == 5`) was the mask. This become a bit clearer when we run it on its own:

In [10]:
iris['sepal_length'] == 5

0      False
1      False
2      False
3      False
4       True
       ...  
145    False
146    False
147    False
148    False
149    False
Name: sepal_length, Length: 150, dtype: bool

We can see we've produced a boolean series. When we apply this to a Pandas dataframe in square brackets, the dataframe compares the indexes and selects only the rows where the corresponding index in the mask is true. For example, we can see row 4 in the mask is true, and row 4 was returned in our initial selection.

## Masks with a variable
However, this doesn't explain the behaviour of `iris['petal_width']`. We can see when we run the below, it just returns a string.

In [11]:
'petal_width'

'petal_width'

What's going on???

From the [docs](https://pandas.pydata.org/docs/user_guide/dsintro.html#column-selection-addition-deletion), you can treat a DataFrame as a 'dict of like-indexed Series objects'. 

This is a lot of terminology to use at once. Let's break it down a bit:

- A dict is a Python data structure. It associates *keys* with *values*, and can be thought of as a mapping. In Python, to retrieve value `v` associated with key `k` from dict `d`, we do `d[k]`. This implies `d[k] == v` always.
- Like indexed means that the indexes (numbers on the right identifying a row) are the same for each column. Since we're going to treat the columns as separate series (spoilers for the next bullet point), it is helpful to bear in mind that the index between columns remains consistent.
- Series objects have previously been talked about, but they can be thought of as a sequence of Python data.


As such, as with dicts, we can retrieve specific columns through passing column names as keys into the DataFrame. For example

In [12]:
iris['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

## Concluding remarks
There are none. Please send feedback to michael.hallam3@gmail.com

(c) Michael Hallam 2022