### Cleaning data with pandas!

What we will learn in this notebook:

- How to drop empty rows and columns
- How to drop rows with null values
- How to replace null values
- How to find and replace placeholder values
- How to change column types

First, let's import pandas and open `../data/simple_data.csv` as a dataframe. Then, let's look at the data!

In [261]:
import pandas as pd

In [262]:
simple_data = pd.read_csv('../data/simple_data.csv')

In [263]:
simple_data

Unnamed: 0,participant_id,empty_column,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,,1.0,4.5,8,2
1,23423.0,,1.0,4.2,NO_DATA,5
2,,,,,,
3,43029.0,,0.0,3.7,4,ASK_LATER
4,30400.0,,1.0,,9,2
5,60495.0,,0.0,4.4,NO_DATA,2
6,12321.0,,1.0,3.2,3,ASK_LATER
7,23090.0,,1.0,2.1,NO_DATA,1
8,99230.0,,,3.2,2,4
9,23432.0,,0.0,,7,6


#### Checking for null values

Uh-oh. At first glance, this dataset looks pretty bad! We can check to see if there's missing data by using the `.is_na()` function. You can check it on the whole dataset, a single column, or a single row. 

Let's check if there's missing data in the whole dataset using the `.isna()` function.

In [264]:
simple_data.isna()

Unnamed: 0,participant_id,empty_column,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,False,True,False,False,False,False
1,False,True,False,False,False,False
2,True,True,True,True,True,True
3,False,True,False,False,False,False
4,False,True,False,True,False,False
5,False,True,False,False,False,False
6,False,True,False,False,False,False
7,False,True,False,False,False,False
8,False,True,True,False,False,False
9,False,True,False,True,False,False


That helps us, kiiind of, but. Let's chain the `.sum()` function after the `.isna()` to get some aggregate counts.

In [265]:
simple_data.isna().sum()

participant_id           1
empty_column            13
missing_values           3
missing_values_2         3
placeholder_values       1
placeholder_values_2     1
dtype: int64

Let's check if there's missing data in row 3 using the `iloc[]` function we learned last week.

In [266]:
simple_data.iloc[2].isna().sum()

6

This data is a mess! Let's clean it up.

#### Dropping empty columns

There's a column, aptly titled `empty_column`, that has no values at all! We can drop it using the `.dropna()` function.

In [267]:
simple_data = simple_data.dropna(axis='columns', how='all')

In [268]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
2,,,,,
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
8,99230.0,,3.2,2,4
9,23432.0,0.0,,7,6


#### Dropping empty rows

It also looks like there's a row with no data at all! We can drop that using `.dropna()` too, but this time we can set our axis to `rows`.

In [269]:
simple_data = simple_data.dropna(axis='rows', how='all')

In [270]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
8,99230.0,,3.2,2,4
9,23432.0,0.0,,7,6
10,21233.0,1.0,2.1,NO_DATA,8


#### Dropping rows with missing values

Sometimes data can be incomplete. We can drop rows with incomplete data by using `.dropna()` in a slightly different way from above. 

Let's remove rows that have `NaN` or a null value, in the `mising_values` column.

In [271]:
simple_data = simple_data.dropna(subset=['missing_values'])

In [272]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
9,23432.0,0.0,,7,6
10,21233.0,1.0,2.1,NO_DATA,8
12,93904.0,1.0,2.8,NO_DATA,0


#### Filling missing values

Sometimes we don't want to drop rows just because they're missing values. Sometimes we feel like we can adequately replace NaNs with an actual value.

Let's replace with the `NaN` values in the `missing_values_2` column with `.fillna()`!

In [273]:
values = {'missing_values_2': 0.0}
simple_data = simple_data.fillna(value=values)

In [274]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,0.0,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
9,23432.0,0.0,0.0,7,6
10,21233.0,1.0,2.1,NO_DATA,8
12,93904.0,1.0,2.8,NO_DATA,0


#### Dropping individual rows

Sometimes people put things that shouldn't be in our data. What are a few unhelpful things in our data right now?

We can drop an individual row by using the `.drop()`

In [280]:
simple_data = simple_data.drop(1)

In [281]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,0.0,9,2
6,12321.0,1.0,3.2,3,ASK_LATER
9,23432.0,0.0,0.0,7,6
10,21233.0,1.0,2.1,NO_DATA,8
12,93904.0,1.0,2.8,NO_DATA,0


#### Dropping lists of rows

In [278]:
simple_data = simple_data.drop([5, 7])

In [279]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,0.0,9,2
6,12321.0,1.0,3.2,3,ASK_LATER
9,23432.0,0.0,0.0,7,6
10,21233.0,1.0,2.1,NO_DATA,8
12,93904.0,1.0,2.8,NO_DATA,0


#### Dropping placeholder values by condition

In [282]:
condition = simple_data.placeholder_values == 'NO_DATA'
indices = simple_data[condition].index
simple_data = simple_data.drop(indices)

In [283]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,0.0,9,2
6,12321.0,1.0,3.2,3,ASK_LATER
9,23432.0,0.0,0.0,7,6


#### Replacing placeholder values

Other times we want to replace placeholder values. To do this, we can use the `.replace()` function.

In [302]:
simple_data['placeholder_values_2'] = simple_data['placeholder_values_2'].replace('ASK_LATER', 0)

In [303]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202,1.0,4.5,8.0,2.0
3,43029,0.0,3.7,4.0,0.0
4,30400,1.0,0.0,9.0,2.0
6,12321,1.0,3.2,3.0,0.0
9,23432,0.0,0.0,7.0,6.0


#### Fixing column types

Our data's looking much better! There's one last thing we need to do to make sure that it's ready for analysis. Let's check the types of each column. You can do this by appending  `.dtypes` to your dataframe's variable name.

In [286]:
simple_data.dtypes

participant_id          float64
missing_values          float64
missing_values_2        float64
placeholder_values       object
placeholder_values_2     object
dtype: object

Notice there are two different data types being used in this dataframe: `float64`, and `object`. Different types have different rules. These rules can help us create guardrails for ourselves. 

For instance, we probably want to be able to do math on all the numbers in the `placeholder_values` and `placeholder_values_2` columns. So let's fix that! Use the `.astype()` function to convert `placeholder_values` and `placeholder_values_2` from an `str` to a `float`.

In [301]:
simple_data['placeholder_values'] = simple_data['placeholder_values'].astype(float)
simple_data['placeholder_values_2'] = simple_data['placeholder_values_2'].astype(float)

In [293]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8.0,2.0
3,43029.0,0.0,3.7,4.0,0.0
4,30400.0,1.0,0.0,9.0,2.0
6,12321.0,1.0,3.2,3.0,0.0
9,23432.0,0.0,0.0,7.0,6.0


Now check the types of each column again by using `.dtypes`. Notice a change?

In [292]:
simple_data.dtypes

participant_id          float64
missing_values          float64
missing_values_2        float64
placeholder_values      float64
placeholder_values_2    float64
dtype: object

One last thing! Right now, the column `participant_id` is a `float64`. We usually don't want or expect to do much on identification numbers, so let's convert that to a `str`.

In [298]:
simple_data['participant_id'] = simple_data['participant_id'].astype(int).astype(str)

In [299]:
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202,1.0,4.5,8.0,2.0
3,43029,0.0,3.7,4.0,0.0
4,30400,1.0,0.0,9.0,2.0
6,12321,1.0,3.2,3.0,0.0
9,23432,0.0,0.0,7.0,6.0


In [300]:
simple_data.dtypes

participant_id           object
missing_values          float64
missing_values_2        float64
placeholder_values      float64
placeholder_values_2    float64
dtype: object