# Creating Masks
This notebook makes part of the Lisbon Data Science meetup - class 2.

In [1]:
import pandas as pd
import numpy as np

# What are masks and why are they useful?
You must certainly have used masks already. They are, boolean arrays that let us access, in our case, to parts of the DataFrame. These *parts* of the dataframe can be defined by using inequalities for instance. The best would be to go forward with some examples.

In [22]:
# Example dataframe
df = pd.DataFrame({'Age': [20, 18, 25, 55, 125, 30],
                   'Height': [165, 189, 359, 149, 175, 163]})
df

Unnamed: 0,Age,Height
0,20,165
1,18,189
2,25,359
3,55,149
4,125,175
5,30,163


Masks are useful to get parts of our dataframe with specific characteristics, for instance, if we want people with age **0 or above** and **below 115**:

In [14]:
my_mask = (df['Age'] >= 0) & (df['Age'] < 115)
my_mask

0     True
1     True
2     True
3     True
4    False
5     True
Name: Age, dtype: bool

**This is our mask!** When dealing with Dataframes, you get a Series in return with the **rows that fulfill your inequalities**. Let us see our mask in practice, where we see that one of the rows (possibly an outlier) was dropped:

In [15]:
df[my_mask]

Unnamed: 0,Age,Height
0,20,165
1,18,189
2,25,359
3,55,149
5,30,163


# Using masks to assign values to dataframe

Our initial dataframe is still...

In [16]:
df

Unnamed: 0,Age,Height
0,20,165
1,18,189
2,25,359
3,55,149
4,125,175
5,30,163


Since we didn't assign anything to it yet! We just took a look at views of the dataframe. Let us drop the row with `Age=125`

In [17]:
df = df[my_mask]
df

Unnamed: 0,Age,Height
0,20,165
1,18,189
2,25,359
3,55,149
5,30,163


But we still have a person that looks too tall to be true. Let's do something about it, let's **trim her** to 155!

In [32]:
mask = df['Height'] == 359
df[mask]['Height'] = 155
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Age,Height
0,20,165
1,18,189
2,25,359
3,55,149
4,125,175
5,30,163


#### Oh no! 
We got a warning! **Maybe we shouldn't have trimmed that person down!!**

Actually, it's not that... The problem is that we are (*or might be*) trying to assign a value (`175`) to a *view* of a dataframe instead of the actual dataframe! And this can be a hidden problem if we disregard the warning. Explaining this would require more time than we actually have, but I recommend you to take a look at the [warning's link](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy). Always pay attention to the warnings - if you don't know what they mean Google them.

The solution for this is to use the [**`.loc[]`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html), which is primarily label based (e.g., using 'Age', 'Height'), but may also be used with a boolean array (which is what we want).

In [34]:
mask = df['Height'] == 359
df.loc[mask, 'Height'] = 155
df

Unnamed: 0,Age,Height
0,20,165
1,18,189
2,25,155
3,55,149
4,125,175
5,30,163


And here we have our dataframe without extreme heights and our ages within a specified range. By the way, if you want to invert your mask in a pythonic way you just need to do this:

In [35]:
~my_mask

0    False
1    False
2    False
3    False
4     True
5    False
Name: Age, dtype: bool