```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. 

```

# Cleaning Data
We can use generic tools like `panda`s and specialized tools like `pyjanitor` to help with cleaning data.

The pyjanitor `clean_names` function will return a DataFrame with columns in lowercase and spaces replaced by underscores.

In [1]:
# Connect with underlying Python code
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')

In [2]:
from datasets import (
    get_dataset
)

In [3]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
df = get_dataset('titanic3')

## Column Names

In [11]:
df.columns = ['Pclass', 'survived', 'name', 'SEX', 'age', 'sibsp', 'pARCH', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home dest']

In [12]:
import janitor as jn

# Columns in lowercase and spaces replaced by underscores
df = jn.clean_names(df)
df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home_dest'],
      dtype='object')

In [16]:
# Strip whitespace around columns, make lower case, replace spaces with underscore
df.columns = ['Pclass', 'survived ', 'name', ' SEX ', 'age', 'sibsp', 'pARCH', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', ' home dest']

def clean_col(name):
    return (
        name.strip().lower().replace(" ", "_")
    )

df = df.rename(columns=clean_col)
df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home_dest'],
      dtype='object')

## Replace Missing Values

The `coalesce` function in `pyjanitor` takes a DataFrame and a list of columns to consider. It returns the first nonnull value for each row.



In [23]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home_dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [22]:
# See function description
df1= jn.coalesce(
    df,
    columns=["body", "age"],
    new_column_name="home.dest",
)
df1.head()

Unnamed: 0,pclass,survived,name,sex,sibsp,parch,ticket,fare,cabin,embarked,boat,home_dest,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,0,0,24160,211.3375,B5,S,2.0,"St Louis, MO",29.0
1,1,1,"Allison, Master. Hudson Trevor",male,1,2,113781,151.55,C22 C26,S,11.0,"Montreal, PQ / Chesterville, ON",0.9167
2,1,0,"Allison, Miss. Helen Loraine",female,1,2,113781,151.55,C22 C26,S,,"Montreal, PQ / Chesterville, ON",2.0
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,1,2,113781,151.55,C22 C26,S,,"Montreal, PQ / Chesterville, ON",135.0
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,1,2,113781,151.55,C22 C26,S,,"Montreal, PQ / Chesterville, ON",25.0


In [25]:
# Fill missing values with a particular value
df1 = df['body'].fillna(10)
df1.head()

0     10.0
1     10.0
2     10.0
3    135.0
4     10.0
Name: body, dtype: float64

In [27]:
df1 = jn.fill_empty(
    df,
    columns=["body"],
    value=10,
)
df1.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home_dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,10.0,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,10.0,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,10.0,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,10.0,"Montreal, PQ / Chesterville, ON"


## Final Check for NaNs

In [28]:
# Return a single boolean if there is any cell that is missing in a DataFrame
df.isna().any().any()

True