# **Managing Nulls with Pandas**

In this notebook, we will take a look at some ways to manage nulls using Pandas DataFrames.

## Data Loading

In [None]:
import pandas as pd
from numpy import random

In [None]:
FILE_PATH = '/content/iot_data.csv'
df = pd.read_csv(FILE_PATH)
df

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,2017-01-01T12:00:23,michaelsmith,12.0,67.0,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0.0,interval
1,2017-01-01T12:01:09,kharrison,6.0,78.0,7256b7b0-e502-f576-62ec-ed73533c9c84,0.0,wake
2,2017-01-01T12:01:34,smithadam,5.0,89.0,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0.0,
3,2017-01-01T12:02:09,eddierodriguez,28.0,76.0,,0.0,update
4,2017-01-01T12:02:36,kenneth94,29.0,62.0,122f1c6a-403c-2221-6ed1-b5caa08f11e0,,
...,...,...,...,...,...,...,...
72018,2017-01-30T07:07:00,jfarmer,22.0,87.0,,0.0,
72019,2017-01-30T07:07:41,epalmer,,67.0,2721f0a2-182a-eda8-5cb8-312f4854b563,,
72020,2017-01-30T07:08:09,sandra28,22.0,64.0,,,interval
72021,2017-01-30T07:08:32,basslisa,,66.0,251c9a6a-5b8b-b401-4934-d1f02ab4fd62,1.0,test


## Exercises

### Exercise 1 - Detect all the null values (including the note: n/a) and fill note column with meaningful value

In [None]:
# Write Code Here
df['note'].isnull().sum()

23915

In [None]:
df.head(10)
df = pd.get_dummies(df, columns=['note'])

In [None]:
df.isnull().sum()

### Exercise 2 - Get all columns which have null values and the count of nulls in each column and percentage of nulls

In [None]:
df[df.columns[df.isnull().any()]].isnull().sum() * 100 / df.shape[0]

temperature    22.077670
heartrate       0.001388
build          22.224845
latest         21.962429
note           33.204671
dtype: float64

### Exercise 3 - Substitute majority values in for missing data in Latest column

In [None]:
new_df = df.fillna({'latest': df['latest'].mode()[0]})
new_df['latest'].isnull().sum()

0

### Exercise 4 - Fill temperature missing values using median fill technique and set timestamp as index

In [None]:
new_df
new_df = df.fillna({'temperature': df['temperature'].median()})
new_df.set_index('timestamp', inplace=True)
new_df.head()

Unnamed: 0_level_0,username,temperature,heartrate,build,latest,note
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01T12:00:23,michaelsmith,12.0,67.0,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0.0,interval
2017-01-01T12:01:09,kharrison,6.0,78.0,7256b7b0-e502-f576-62ec-ed73533c9c84,0.0,wake
2017-01-01T12:01:34,smithadam,5.0,89.0,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0.0,
2017-01-01T12:02:09,eddierodriguez,28.0,76.0,,0.0,update
2017-01-01T12:02:36,kenneth94,29.0,62.0,122f1c6a-403c-2221-6ed1-b5caa08f11e0,,


### Exercise 5 - Fill null values in build column using uuid library.
**Note: Each null value should be filled with unique value**

In [None]:
import uuid
new_df['build'] = [uuid.uuid4() for _ in range(len(new_df.index))]

new_df['build'].duplicated().sum()

0

### Exercise 6 - Remove all rows with nulls

In [None]:
new_df.dropna(axis="rows")

Unnamed: 0_level_0,username,temperature,heartrate,build,latest,note
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01T12:00:23,michaelsmith,12.0,67.0,811a3c4a-487f-4be8-b3e0-244ebb593b90,0.0,interval
2017-01-01T12:01:09,kharrison,6.0,78.0,2e7d5de8-fb02-4f24-95e0-fe06f26d6587,0.0,wake
2017-01-01T12:02:09,eddierodriguez,28.0,76.0,3ca19d8f-2259-4339-8daf-3e0236c2946d,0.0,update
2017-01-01T12:03:04,bryanttodd,13.0,86.0,48518a15-9e5b-40ba-b79e-472afeb08a78,0.0,interval
2017-01-01T12:05:41,moorejeffrey,25.0,63.0,fcf374d9-b70b-4d69-a0f4-0c0dd8ca0ba9,0.0,wake
...,...,...,...,...,...,...
2017-01-30T07:03:13,nnelson,16.0,68.0,106f9451-4a06-47ed-b3e9-0eb9cd7202bb,0.0,user
2017-01-30T07:04:34,gnunez,17.0,74.0,f45fb923-7ee6-4687-81d4-3ee3edb205ce,0.0,interval
2017-01-30T07:04:55,astout,16.0,69.0,b2c1b75e-c1ad-4504-8732-6ae63911c6fb,0.0,wake
2017-01-30T07:05:39,agonzalez,17.0,63.0,7ab3cac0-6e19-4155-9b01-8ebcd2511d83,1.0,test
