### Managing Nulls with Pandas

In this notebook, we will take a look at some ways to manage nulls using Pandas DataFrames.

For even more details on how to do this, check out the [Panda's documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html).

In [None]:
import pandas as pd
from numpy import random

In [None]:
df = pd.read_csv('../data/iot_example_with_nulls.csv')

### Data Quality Check

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.note.value_counts()

### Let's remove all null values (including the note: n/a)

In [None]:
df = pd.read_csv('../data/iot_example_with_nulls.csv', 
                 na_values=['n/a'])

### Test to see if we can use dropna

In [None]:
df.shape

In [None]:
df.dropna().shape

In [None]:
df.dropna(how='all', axis=1).shape

### Test to see if we can drop columns

In [None]:
my_columns = list(df.columns)

In [None]:
my_columns

In [None]:
list(df.dropna(thresh=int(df.shape[0] * .9), axis=1).columns)

### I want to find all columns that have missing data

In [None]:
missing_info = list(df.columns[df.isnull().any()])

In [None]:
missing_info

In [None]:
for col in missing_info:
    num_missing = df[df[col].isnull() == True].shape[0]
    print('number missing for column {}: {}'.format(col, 
                                                    num_missing))

In [None]:
for col in missing_info:
    percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
    print('percent missing for column {}: {}'.format(
        col, percent_missing))

### Can I easily substitute majority values in for missing data?

In [None]:
df.note.value_counts()

In [None]:
df.build.value_counts().head()

In [None]:
df.latest.value_counts()

In [None]:
df.latest = df.latest.fillna(0)

### Have not yet addressed temperature missing values... Let's find a way to fill

In [None]:
df.username.value_counts().head()

In [None]:
df = df.set_index('timestamp')

In [None]:
df.head()

In [None]:
df.temperature = df.groupby('username').temperature.fillna(
    method='backfill', limit=3)

### Exercise: How many temperature values did I fill? What percentage of values are still missing (for temperature)?

In [None]:
# %load ../solutions/nulls.py


In [None]:
rows_filled

In [None]:
still_missing