# Data Filtering

In [None]:
import pandas as pd

# Load Dataset

In [None]:
ttn = pd.read_csv('../datasets/titanic.csv')
ttn

# Handling missing values
Missing values in Pandas can be seen as:
* __NaN__ - __Nan__ - __nan__
* __Null__ - __null__
* __NA__ - __na__

Conceptually they all mean a __missing value__.

## Identify missing values

__Pandas Methods:__
* `isnull()`: Returns `True` for __missing values__ (`False` otherwise).
* `notnull()`: __The inverse of `isnull()`__. Returns `True` for __non-missing values__ (`False` otherwise).

__Note:__ Be aware of the `isna()` and `notna()` methods. They are identical to `isnull()` and `notnull()` respectively.
<br>

__Examples:__ Testing `isnull()` and `notnull()`

In [None]:
# Let's test isnull()



In [None]:
# Let's test notnull()



## Why/How are these methods useful?

### 1. Find the number of missing values in each column

__Question:__ Which columns have missing values, and how many?

In [None]:
# We will use the `sum()` method to find out



#### Why did this work?
1. Boolean values (`False` and `True`) are essentially __0__ and __1__ under the hood.
2. __Remember:__ `sum()` works in the vertical axis by default (`axis=0`).

So, we count all the `True` occurencies (in our case the __missing values__ since we used `isnull()`)

### 2. Filter the rows with missing value in a particular field

__Question:__ Which people are the ones we don't know where they __embarked__ from?

In [None]:
# Step 1: We want the rows where 'Embarked' is null
# Step 2: We want all the columns for these rows



## What should I do about missing values?

There is not a single answer to that.

__It depends on the dataset, the analysis we try to do, the problem we try to solve, etc...__

### But we have some options

### 1. Drop missing values

__Python Method:__ `dropna()`

Allows us to drop rows in respect to missing values.

<br>

__Usecases:__
* Drop a row if __any__ of the fields is missing.
* Drop a row if __all__ of the fields are missing.
* Drop a row if (all or any fields are missing) in a __subset__ of columns.

__Example 1:__ Drop the rows which __any__ of their fields has missing values:

In [None]:
# Use dropna() to remove rows with how='any'


# Print the shape of the original dataset



<br>

__Example 2:__ Drop the rows which __all__ of their fields are missing.

In [None]:
# Use dropna() to remove rows with how='all'



<br>

__Example:__ Drop the rows which __any__ of their values are missing, in a __subset__ of columns.

In [None]:
# Use dropna() to remove rows with how='any' and subset=['Age', 'Embarked']



### 2. Fill in missing values

__Python Method:__ `fillna()`

Fills up the missing values with a given value.

__Example:__ Fill missing __Cabin__ values, with the value `NO_CABIN`

In [None]:
# Print the value counts of the Cabin - including missing


# Use fillna() to fill Cabin null values with value='NO_CABIN'



In [None]:
# Print the value counts of the Age - including missing


# Do the same with Age, but fill in with the average age



# Data Sorting

In [None]:
# Sort values by column 'Age' - ascending/descending



In [None]:
# Sort values by more than one columns - ascending & descending



# Filtering

In [None]:
# Get the female passengers that are older than 50 years old




In [None]:
# Get the passengers that are older than 70 years or younger than 1
# and save the result in a new DataFrame




In [None]:
# Reset index (drop old index) of new dataframe



In [None]:
# Set PassengerId as Index



---
# Aggregate Statistics (Groupby)

In [None]:
# Group-by 'Pclass', display mean of numeric only values
ttn.groupby(['Pclass']).mean(numeric_only=True)